Skip to main content

Chunk Text

Split text or document elements into chunks optimized for embedding models and vector databases. Essential for RAG (Retrieval-Augmented Generation) workflows where documents need to be broken into searchable segments.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Text or Elements - Can accept either:
    • Plain text string (from Extract Text or any source)
    • Elements array (from Read Document node)

Options

  • Chunking Strategy - How to split the content. Default is "Basic (Size-based)".
    • Basic (Size-based) - Splits text based purely on size, trying to break at sentence boundaries
    • By Title (Semantic sections) - Preserves document structure by keeping sections with their titles together
  • Max Characters - Maximum characters per chunk. Default is 1000. Recommended ranges:
    • 500-800 for short-form content
    • 800-1200 for general documents
    • 1200-2000 for technical documents with context requirements
  • Overlap - Number of characters to overlap between chunks. Default is 200. This helps maintain context across chunk boundaries.
  • Combine Under N Chars - (By Title strategy only) Merge small sections under this size with adjacent sections. Default is 500. Prevents tiny chunks from single sentences.

Output

  • Chunks - Array of chunk objects, each containing:
    • text - The chunk content
    • index - Chunk position (0-based)
    • metadata - (if available) Information from original elements
      • page - Page number where chunk content appears
      • elementCount - Number of original elements combined into this chunk
  • Chunk Count - Number of chunks created.

Chunking Strategies Explained

Basic Strategy (Size-based)

Simple chunking that splits text when it exceeds the max character limit. The algorithm:

  1. Splits text at max character boundary
  2. Attempts to break at paragraph boundaries (\n\n)
  3. Falls back to sentence boundaries (. , ! , ? )
  4. Creates overlap between chunks for context preservation

Best for:

  • Unstructured text without clear sections
  • Simple documents or articles
  • Fast processing needs
  • Content without meaningful headings

Example:

Input: "Long document text without structure..."
Output: [
{ text: "First 1000 chars...", index: 0 },
{ text: "...overlap... next 1000 chars...", index: 1 }
]

By Title Strategy (Semantic)

Intelligent chunking that respects document structure. The algorithm:

  1. Groups content by section titles and headings
  2. Keeps titles with their content
  3. Splits sections if they exceed max characters
  4. Combines small sections to avoid tiny chunks
  5. Preserves semantic boundaries

Best for:

  • Structured documents with clear sections
  • Technical documentation
  • Books or long reports
  • Maintaining topical coherence

Example:

Input: [
{ text: "Introduction", type: "Title" },
{ text: "This is the intro...", type: "NarrativeText" },
{ text: "Chapter 1", type: "Title" },
{ text: "Chapter content...", type: "NarrativeText" }
]
Output: [
{ text: "Introduction\n\nThis is the intro...", index: 0, metadata: { page: 1 } },
{ text: "Chapter 1\n\nChapter content...", index: 1, metadata: { page: 2 } }
]

How It Works

The Chunk Text node processes your content based on the selected strategy. When executed, the node:

  1. Validates input (text string or elements array)
  2. Parses configuration options (max characters, overlap, etc.)
  3. For plain text: Applies basic chunking with sentence-aware splitting
  4. For elements array: Uses selected strategy (basic or by_title)
  5. Ensures no chunk exceeds max character limit
  6. Adds overlap between chunks for context continuity
  7. Preserves metadata from original elements when available
  8. Returns array of chunks with indices and metadata

Requirements

  • Valid text string or elements array from Read Document
  • Reasonable max characters value (at least 100 characters)
  • Overlap should be less than max characters

Error Handling

The node will return specific errors in the following cases:

  • Empty or missing input
  • Invalid max characters (must be positive number)
  • Invalid overlap (must be non-negative number)
  • Input type is neither string nor array

Usage Examples

Example 1: Basic RAG Pipeline

// Read Document → Chunk Text (By Title) → Prepare for Embeddings

// Read Document output
msg.elements = [/* document elements */]

// Chunk Text output
msg.chunks = [
{
text: "Introduction\n\nMachine learning is a field of AI...",
index: 0,
metadata: { page: 1, elementCount: 3 }
},
{
text: "Supervised Learning\n\nIn supervised learning, we train models...",
index: 1,
metadata: { page: 2, elementCount: 5 }
}
]
msg.chunkCount = 2

Example 2: Plain Text Chunking

// Extract Text → Chunk Text (Basic)

msg.text = "Very long article text without structure...";

// Chunk Text with Max Characters: 800, Overlap: 150
msg.chunks = [
{ text: "First 800 characters of text...", index: 0 },
{ text: "...last 150 chars from chunk 0... next 650 chars...", index: 1 }
]

Example 3: Custom Chunk Sizes for Different Models

// For OpenAI text-embedding-3-small (optimized for 512 tokens ~= 600 chars)
// Set Max Characters: 600, Overlap: 100

// For longer context models
// Set Max Characters: 1500, Overlap: 200

Example 4: Processing Chunks Individually

// After Chunk Text
// Use Loop node to process each chunk

msg.chunks.forEach(chunk => {
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 50)}...`);
console.log(` Page: ${chunk.metadata?.page || 'unknown'}`);
console.log(` Length: ${chunk.text.length} chars`);
});

Example 5: Filter Chunks by Content

// After Chunk Text

// Find chunks containing specific keywords
const relevantChunks = msg.chunks.filter(chunk =>
chunk.text.toLowerCase().includes('machine learning') ||
chunk.text.toLowerCase().includes('neural network')
);

msg.filteredChunks = relevantChunks;
msg.filteredCount = relevantChunks.length;

Example 6: Chunk Size Analysis

// After Chunk Text

const chunkStats = {
totalChunks: msg.chunks.length,
avgLength: Math.round(
msg.chunks.reduce((sum, c) => sum + c.text.length, 0) / msg.chunks.length
),
minLength: Math.min(...msg.chunks.map(c => c.text.length)),
maxLength: Math.max(...msg.chunks.map(c => c.text.length)),
chunksWithMetadata: msg.chunks.filter(c => c.metadata).length
};

msg.chunkStats = chunkStats;

Tips for Effective Use

  1. Choosing Chunk Size:

    • Small chunks (300-500 chars): Better precision, more chunks, higher embedding costs
    • Medium chunks (800-1200 chars): Balanced approach, recommended for most cases
    • Large chunks (1500-2000 chars): More context, fewer chunks, may reduce precision
  2. Overlap Best Practices:

    • Use 10-20% of chunk size as overlap (e.g., 200 chars for 1000 char chunks)
    • Higher overlap = more context continuity but more redundancy
    • Lower overlap = less redundancy but potential context loss at boundaries
  3. Strategy Selection:

    • Use Basic for articles, blog posts, or unstructured content
    • Use By Title for documentation, books, reports with clear sections
    • By Title preserves topical coherence which improves retrieval accuracy
  4. Optimize for Your Embedding Model:

    • Check your embedding model's token limit
    • Account for ~4 characters per token (rough estimate)
    • Leave buffer room: if limit is 512 tokens, use 400-450 chars max
  5. Combine Under N Chars (By Title):

    • Set to ~50% of max characters
    • Prevents orphan chunks from single short paragraphs
    • Maintains meaningful chunk sizes

Common Errors and Solutions

Error: "Text or elements input is required"

Cause: No input was provided to the node. Solution: Connect either Extract Text output or Read Document elements output to this node.

Error: "Max Characters must be a positive number"

Cause: Invalid or zero max characters value. Solution: Set Max Characters to a positive number (recommended: 800-1200).

Error: "Overlap cannot be negative"

Cause: Negative overlap value was provided. Solution: Set Overlap to 0 or a positive number less than Max Characters.

Error: "Input must be a text string or an array"

Cause: Input is neither string nor array. Solution: Ensure input comes from Extract Text (string) or Read Document (array).

Warning: Too Many Chunks Created

Cause: Max Characters is very small or document is very large. Solution:

  • Increase Max Characters to reduce chunk count
  • Consider processing document in sections
  • Check if this will impact embedding API costs

Issue: Chunks Too Small or Too Large

Cause: Suboptimal max characters or combine settings. Solution:

  • Adjust Max Characters based on actual chunk lengths
  • For By Title strategy, adjust Combine Under N Chars
  • Review chunk size distribution with analysis code (Example 6)

Integration with Other Nodes

Complete RAG Ingestion Pipeline

Read Document → Chunk Text → Prepare for Embeddings → OpenAI Generate Batch Embeddings

LanceDB Add Data ← Function (merge vectors)

Simple Text to Vector Database

Extract Text → Chunk Text (Basic) → Prepare for Embeddings → Embeddings → Database

Multi-Document Processing

Loop (files) → Read Document → Chunk Text → Collect Chunks → Batch Process → Store

Quality Control Pipeline

Read Document → Chunk Text → Function (validate chunks) → Filter → Process

Performance Considerations

  • Speed: Basic strategy is faster than By Title strategy
  • Memory: Chunks are stored in memory; very large documents may need batching
  • Chunk Count: More chunks = more embedding API calls = higher cost
  • Quality vs Cost: Larger chunks = fewer embeddings but may reduce search precision

Chunking Strategy Comparison

AspectBasic (Size-based)By Title (Semantic)
SpeedFasterSlower (more processing)
ContextMay break mid-topicPreserves topical boundaries
Best ForUnstructured textStructured documents
Chunk QualityVariableMore consistent
ComplexitySimpleIntelligent
Input RequiredString or elementsElements (from Read Document)

Best Practices for RAG

  1. Document Structure Matters:

    • Use By Title strategy for structured documents (better retrieval)
    • Use Basic strategy for unstructured content
  2. Overlap is Critical:

    • Always use some overlap (150-200 chars) to maintain context
    • Questions spanning chunk boundaries are better answered with overlap
  3. Chunk Size Optimization:

    • Start with 1000 characters and adjust based on retrieval quality
    • Monitor chunk size distribution
    • Consider your typical query length when setting chunk size
  4. Metadata Preservation:

    • Chunks preserve page numbers when available
    • Use metadata to cite sources in responses
    • Track elementCount to understand chunk composition
  5. Testing and Iteration:

    • Test retrieval quality with different chunk sizes
    • Analyze failed retrievals to optimize chunking parameters
    • Balance cost (chunk count) vs quality (chunk size)

Advanced Use Cases

Dynamic Chunk Size Based on Content Type

// Before Chunk Text, in a Function node
const contentType = msg.documentType;

if (contentType === 'code-documentation') {
msg.maxChars = 1500; // Larger for code examples
msg.overlap = 300;
} else if (contentType === 'news-article') {
msg.maxChars = 600; // Smaller for quick facts
msg.overlap = 100;
} else {
msg.maxChars = 1000; // Default
msg.overlap = 200;
}

Hierarchical Chunking

// After Chunk Text
// Create parent-child relationships

msg.chunks.forEach((chunk, index) => {
chunk.parentDoc = msg.sourceFilePath;
chunk.chunkId = `${msg.documentId}_chunk_${index}`;

// Link to previous/next chunks
if (index > 0) chunk.previousChunk = `${msg.documentId}_chunk_${index - 1}`;
if (index < msg.chunks.length - 1) chunk.nextChunk = `${msg.documentId}_chunk_${index + 1}`;
});