Chunk Text
Split text or document elements into chunks optimized for embedding models and vector databases. Essential for RAG (Retrieval-Augmented Generation) workflows where documents need to be broken into searchable segments.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Text or Elements - Can accept either:
- Plain text string (from Extract Text or any source)
- Elements array (from Read Document node)
Options
- Chunking Strategy - How to split the content. Default is "Basic (Size-based)".
- Basic (Size-based) - Splits text based purely on size, trying to break at sentence boundaries
- By Title (Semantic sections) - Preserves document structure by keeping sections with their titles together
- Max Characters - Maximum characters per chunk. Default is 1000. Recommended ranges:
- 500-800 for short-form content
- 800-1200 for general documents
- 1200-2000 for technical documents with context requirements
- Overlap - Number of characters to overlap between chunks. Default is 200. This helps maintain context across chunk boundaries.
- Combine Under N Chars - (By Title strategy only) Merge small sections under this size with adjacent sections. Default is 500. Prevents tiny chunks from single sentences.
Output
- Chunks - Array of chunk objects, each containing:
text- The chunk contentindex- Chunk position (0-based)metadata- (if available) Information from original elementspage- Page number where chunk content appearselementCount- Number of original elements combined into this chunk
- Chunk Count - Number of chunks created.
Chunking Strategies Explained
Basic Strategy (Size-based)
Simple chunking that splits text when it exceeds the max character limit. The algorithm:
- Splits text at max character boundary
- Attempts to break at paragraph boundaries (
\n\n) - Falls back to sentence boundaries (
.,!,?) - Creates overlap between chunks for context preservation
Best for:
- Unstructured text without clear sections
- Simple documents or articles
- Fast processing needs
- Content without meaningful headings
Example:
Input: "Long document text without structure..."
Output: [
{ text: "First 1000 chars...", index: 0 },
{ text: "...overlap... next 1000 chars...", index: 1 }
]
By Title Strategy (Semantic)
Intelligent chunking that respects document structure. The algorithm:
- Groups content by section titles and headings
- Keeps titles with their content
- Splits sections if they exceed max characters
- Combines small sections to avoid tiny chunks
- Preserves semantic boundaries
Best for:
- Structured documents with clear sections
- Technical documentation
- Books or long reports
- Maintaining topical coherence
Example:
Input: [
{ text: "Introduction", type: "Title" },
{ text: "This is the intro...", type: "NarrativeText" },
{ text: "Chapter 1", type: "Title" },
{ text: "Chapter content...", type: "NarrativeText" }
]
Output: [
{ text: "Introduction\n\nThis is the intro...", index: 0, metadata: { page: 1 } },
{ text: "Chapter 1\n\nChapter content...", index: 1, metadata: { page: 2 } }
]
How It Works
The Chunk Text node processes your content based on the selected strategy. When executed, the node:
- Validates input (text string or elements array)
- Parses configuration options (max characters, overlap, etc.)
- For plain text: Applies basic chunking with sentence-aware splitting
- For elements array: Uses selected strategy (basic or by_title)
- Ensures no chunk exceeds max character limit
- Adds overlap between chunks for context continuity
- Preserves metadata from original elements when available
- Returns array of chunks with indices and metadata
Requirements
- Valid text string or elements array from Read Document
- Reasonable max characters value (at least 100 characters)
- Overlap should be less than max characters
Error Handling
The node will return specific errors in the following cases:
- Empty or missing input
- Invalid max characters (must be positive number)
- Invalid overlap (must be non-negative number)
- Input type is neither string nor array
Usage Examples
Example 1: Basic RAG Pipeline
// Read Document → Chunk Text (By Title) → Prepare for Embeddings
// Read Document output
msg.elements = [/* document elements */]
// Chunk Text output
msg.chunks = [
{
text: "Introduction\n\nMachine learning is a field of AI...",
index: 0,
metadata: { page: 1, elementCount: 3 }
},
{
text: "Supervised Learning\n\nIn supervised learning, we train models...",
index: 1,
metadata: { page: 2, elementCount: 5 }
}
]
msg.chunkCount = 2
Example 2: Plain Text Chunking
// Extract Text → Chunk Text (Basic)
msg.text = "Very long article text without structure...";
// Chunk Text with Max Characters: 800, Overlap: 150
msg.chunks = [
{ text: "First 800 characters of text...", index: 0 },
{ text: "...last 150 chars from chunk 0... next 650 chars...", index: 1 }
]
Example 3: Custom Chunk Sizes for Different Models
// For OpenAI text-embedding-3-small (optimized for 512 tokens ~= 600 chars)
// Set Max Characters: 600, Overlap: 100
// For longer context models
// Set Max Characters: 1500, Overlap: 200
Example 4: Processing Chunks Individually
// After Chunk Text
// Use Loop node to process each chunk
msg.chunks.forEach(chunk => {
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 50)}...`);
console.log(` Page: ${chunk.metadata?.page || 'unknown'}`);
console.log(` Length: ${chunk.text.length} chars`);
});
Example 5: Filter Chunks by Content
// After Chunk Text
// Find chunks containing specific keywords
const relevantChunks = msg.chunks.filter(chunk =>
chunk.text.toLowerCase().includes('machine learning') ||
chunk.text.toLowerCase().includes('neural network')
);
msg.filteredChunks = relevantChunks;
msg.filteredCount = relevantChunks.length;
Example 6: Chunk Size Analysis
// After Chunk Text
const chunkStats = {
totalChunks: msg.chunks.length,
avgLength: Math.round(
msg.chunks.reduce((sum, c) => sum + c.text.length, 0) / msg.chunks.length
),
minLength: Math.min(...msg.chunks.map(c => c.text.length)),
maxLength: Math.max(...msg.chunks.map(c => c.text.length)),
chunksWithMetadata: msg.chunks.filter(c => c.metadata).length
};
msg.chunkStats = chunkStats;
Tips for Effective Use
-
Choosing Chunk Size:
- Small chunks (300-500 chars): Better precision, more chunks, higher embedding costs
- Medium chunks (800-1200 chars): Balanced approach, recommended for most cases
- Large chunks (1500-2000 chars): More context, fewer chunks, may reduce precision
-
Overlap Best Practices:
- Use 10-20% of chunk size as overlap (e.g., 200 chars for 1000 char chunks)
- Higher overlap = more context continuity but more redundancy
- Lower overlap = less redundancy but potential context loss at boundaries
-
Strategy Selection:
- Use Basic for articles, blog posts, or unstructured content
- Use By Title for documentation, books, reports with clear sections
- By Title preserves topical coherence which improves retrieval accuracy
-
Optimize for Your Embedding Model:
- Check your embedding model's token limit
- Account for ~4 characters per token (rough estimate)
- Leave buffer room: if limit is 512 tokens, use 400-450 chars max
-
Combine Under N Chars (By Title):
- Set to ~50% of max characters
- Prevents orphan chunks from single short paragraphs
- Maintains meaningful chunk sizes
Common Errors and Solutions
Error: "Text or elements input is required"
Cause: No input was provided to the node. Solution: Connect either Extract Text output or Read Document elements output to this node.
Error: "Max Characters must be a positive number"
Cause: Invalid or zero max characters value. Solution: Set Max Characters to a positive number (recommended: 800-1200).
Error: "Overlap cannot be negative"
Cause: Negative overlap value was provided. Solution: Set Overlap to 0 or a positive number less than Max Characters.
Error: "Input must be a text string or an array"
Cause: Input is neither string nor array. Solution: Ensure input comes from Extract Text (string) or Read Document (array).
Warning: Too Many Chunks Created
Cause: Max Characters is very small or document is very large. Solution:
- Increase Max Characters to reduce chunk count
- Consider processing document in sections
- Check if this will impact embedding API costs
Issue: Chunks Too Small or Too Large
Cause: Suboptimal max characters or combine settings. Solution:
- Adjust Max Characters based on actual chunk lengths
- For By Title strategy, adjust Combine Under N Chars
- Review chunk size distribution with analysis code (Example 6)
Integration with Other Nodes
Complete RAG Ingestion Pipeline
Read Document → Chunk Text → Prepare for Embeddings → OpenAI Generate Batch Embeddings
↓
LanceDB Add Data ← Function (merge vectors)
Simple Text to Vector Database
Extract Text → Chunk Text (Basic) → Prepare for Embeddings → Embeddings → Database
Multi-Document Processing
Loop (files) → Read Document → Chunk Text → Collect Chunks → Batch Process → Store
Quality Control Pipeline
Read Document → Chunk Text → Function (validate chunks) → Filter → Process
Performance Considerations
- Speed: Basic strategy is faster than By Title strategy
- Memory: Chunks are stored in memory; very large documents may need batching
- Chunk Count: More chunks = more embedding API calls = higher cost
- Quality vs Cost: Larger chunks = fewer embeddings but may reduce search precision
Chunking Strategy Comparison
| Aspect | Basic (Size-based) | By Title (Semantic) |
|---|---|---|
| Speed | Faster | Slower (more processing) |
| Context | May break mid-topic | Preserves topical boundaries |
| Best For | Unstructured text | Structured documents |
| Chunk Quality | Variable | More consistent |
| Complexity | Simple | Intelligent |
| Input Required | String or elements | Elements (from Read Document) |
Best Practices for RAG
-
Document Structure Matters:
- Use By Title strategy for structured documents (better retrieval)
- Use Basic strategy for unstructured content
-
Overlap is Critical:
- Always use some overlap (150-200 chars) to maintain context
- Questions spanning chunk boundaries are better answered with overlap
-
Chunk Size Optimization:
- Start with 1000 characters and adjust based on retrieval quality
- Monitor chunk size distribution
- Consider your typical query length when setting chunk size
-
Metadata Preservation:
- Chunks preserve page numbers when available
- Use metadata to cite sources in responses
- Track elementCount to understand chunk composition
-
Testing and Iteration:
- Test retrieval quality with different chunk sizes
- Analyze failed retrievals to optimize chunking parameters
- Balance cost (chunk count) vs quality (chunk size)
Advanced Use Cases
Dynamic Chunk Size Based on Content Type
// Before Chunk Text, in a Function node
const contentType = msg.documentType;
if (contentType === 'code-documentation') {
msg.maxChars = 1500; // Larger for code examples
msg.overlap = 300;
} else if (contentType === 'news-article') {
msg.maxChars = 600; // Smaller for quick facts
msg.overlap = 100;
} else {
msg.maxChars = 1000; // Default
msg.overlap = 200;
}
Hierarchical Chunking
// After Chunk Text
// Create parent-child relationships
msg.chunks.forEach((chunk, index) => {
chunk.parentDoc = msg.sourceFilePath;
chunk.chunkId = `${msg.documentId}_chunk_${index}`;
// Link to previous/next chunks
if (index > 0) chunk.previousChunk = `${msg.documentId}_chunk_${index - 1}`;
if (index < msg.chunks.length - 1) chunk.nextChunk = `${msg.documentId}_chunk_${index + 1}`;
});