Chunk Text

Split text or document elements into chunks optimized for embedding models and vector databases. Essential for RAG (Retrieval-Augmented Generation) workflows where documents need to be broken into searchable segments.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Text or Elements - Can accept either:
- Plain text string (from Extract Text or any source)
- Elements array (from Read Document node)

Options

Chunking Strategy - How to split the content. Default is "Basic (Size-based)".
- Basic (Size-based) - Splits text based purely on size, trying to break at sentence boundaries
- By Title (Semantic sections) - Preserves document structure by keeping sections with their titles together
Max Characters - Maximum characters per chunk. Default is 1000. Recommended ranges:
- 500-800 for short-form content
- 800-1200 for general documents
- 1200-2000 for technical documents with context requirements
Overlap - Number of characters to overlap between chunks. Default is 200. This helps maintain context across chunk boundaries.
Combine Under N Chars - (By Title strategy only) Merge small sections under this size with adjacent sections. Default is 500. Prevents tiny chunks from single sentences.

Output

Chunks - Array of chunk objects, each containing:
- text - The chunk content
- index - Chunk position (0-based)
- metadata - (if available) Information from original elements
  - page - Page number where chunk content appears
  - elementCount - Number of original elements combined into this chunk
Chunk Count - Number of chunks created.

Chunking Strategies Explained

Basic Strategy (Size-based)

Simple chunking that splits text when it exceeds the max character limit. The algorithm:

Splits text at max character boundary
Attempts to break at paragraph boundaries (\n\n)
Falls back to sentence boundaries (. , ! , ? )
Creates overlap between chunks for context preservation

Best for:

Unstructured text without clear sections
Simple documents or articles
Fast processing needs
Content without meaningful headings

Example:

Input: "Long document text without structure..."
Output: [
  { text: "First 1000 chars...", index: 0 },
  { text: "...overlap... next 1000 chars...", index: 1 }
]

By Title Strategy (Semantic)

Intelligent chunking that respects document structure. The algorithm:

Groups content by section titles and headings
Keeps titles with their content
Splits sections if they exceed max characters
Combines small sections to avoid tiny chunks
Preserves semantic boundaries

Best for:

Structured documents with clear sections
Technical documentation
Books or long reports
Maintaining topical coherence

Example:

Input: [
  { text: "Introduction", type: "Title" },
  { text: "This is the intro...", type: "NarrativeText" },
  { text: "Chapter 1", type: "Title" },
  { text: "Chapter content...", type: "NarrativeText" }
]
Output: [
  { text: "Introduction\n\nThis is the intro...", index: 0, metadata: { page: 1 } },
  { text: "Chapter 1\n\nChapter content...", index: 1, metadata: { page: 2 } }
]

How It Works

The Chunk Text node processes your content based on the selected strategy. When executed, the node:

Validates input (text string or elements array)
Parses configuration options (max characters, overlap, etc.)
For plain text: Applies basic chunking with sentence-aware splitting
For elements array: Uses selected strategy (basic or by_title)
Ensures no chunk exceeds max character limit
Adds overlap between chunks for context continuity
Preserves metadata from original elements when available
Returns array of chunks with indices and metadata

Requirements

Valid text string or elements array from Read Document
Reasonable max characters value (at least 100 characters)
Overlap should be less than max characters

Error Handling

The node will return specific errors in the following cases:

Empty or missing input
Invalid max characters (must be positive number)
Invalid overlap (must be non-negative number)
Input type is neither string nor array

Usage Examples

Example 1: Basic RAG Pipeline

// Read Document → Chunk Text (By Title) → Prepare for Embeddings

// Read Document output
msg.elements = [/* document elements */]

// Chunk Text output
msg.chunks = [
  {
    text: "Introduction\n\nMachine learning is a field of AI...",
    index: 0,
    metadata: { page: 1, elementCount: 3 }
  },
  {
    text: "Supervised Learning\n\nIn supervised learning, we train models...",
    index: 1,
    metadata: { page: 2, elementCount: 5 }
  }
]
msg.chunkCount = 2

Example 2: Plain Text Chunking

// Extract Text → Chunk Text (Basic)

msg.text = "Very long article text without structure...";

// Chunk Text with Max Characters: 800, Overlap: 150
msg.chunks = [
  { text: "First 800 characters of text...", index: 0 },
  { text: "...last 150 chars from chunk 0... next 650 chars...", index: 1 }
]

Example 3: Custom Chunk Sizes for Different Models

// For OpenAI text-embedding-3-small (optimized for 512 tokens ~= 600 chars)
// Set Max Characters: 600, Overlap: 100

// For longer context models
// Set Max Characters: 1500, Overlap: 200

Example 4: Processing Chunks Individually

// After Chunk Text
// Use Loop node to process each chunk

msg.chunks.forEach(chunk => {
  console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 50)}...`);
  console.log(`  Page: ${chunk.metadata?.page || 'unknown'}`);
  console.log(`  Length: ${chunk.text.length} chars`);
});

Example 5: Filter Chunks by Content

// After Chunk Text

// Find chunks containing specific keywords
const relevantChunks = msg.chunks.filter(chunk =>
  chunk.text.toLowerCase().includes('machine learning') ||
  chunk.text.toLowerCase().includes('neural network')
);

msg.filteredChunks = relevantChunks;
msg.filteredCount = relevantChunks.length;

Example 6: Chunk Size Analysis

// After Chunk Text

const chunkStats = {
  totalChunks: msg.chunks.length,
  avgLength: Math.round(
    msg.chunks.reduce((sum, c) => sum + c.text.length, 0) / msg.chunks.length
  ),
  minLength: Math.min(...msg.chunks.map(c => c.text.length)),
  maxLength: Math.max(...msg.chunks.map(c => c.text.length)),
  chunksWithMetadata: msg.chunks.filter(c => c.metadata).length
};

msg.chunkStats = chunkStats;

Tips for Effective Use

Choosing Chunk Size:
- Small chunks (300-500 chars): Better precision, more chunks, higher embedding costs
- Medium chunks (800-1200 chars): Balanced approach, recommended for most cases
- Large chunks (1500-2000 chars): More context, fewer chunks, may reduce precision
Overlap Best Practices:
- Use 10-20% of chunk size as overlap (e.g., 200 chars for 1000 char chunks)
- Higher overlap = more context continuity but more redundancy
- Lower overlap = less redundancy but potential context loss at boundaries
Strategy Selection:
- Use Basic for articles, blog posts, or unstructured content
- Use By Title for documentation, books, reports with clear sections
- By Title preserves topical coherence which improves retrieval accuracy
Optimize for Your Embedding Model:
- Check your embedding model's token limit
- Account for ~4 characters per token (rough estimate)
- Leave buffer room: if limit is 512 tokens, use 400-450 chars max
Combine Under N Chars (By Title):
- Set to ~50% of max characters
- Prevents orphan chunks from single short paragraphs
- Maintains meaningful chunk sizes

Common Errors and Solutions

Error: "Text or elements input is required"

Cause: No input was provided to the node. Solution: Connect either Extract Text output or Read Document elements output to this node.

Error: "Max Characters must be a positive number"

Cause: Invalid or zero max characters value. Solution: Set Max Characters to a positive number (recommended: 800-1200).

Error: "Overlap cannot be negative"

Cause: Negative overlap value was provided. Solution: Set Overlap to 0 or a positive number less than Max Characters.

Error: "Input must be a text string or an array"

Cause: Input is neither string nor array. Solution: Ensure input comes from Extract Text (string) or Read Document (array).

Warning: Too Many Chunks Created

Cause: Max Characters is very small or document is very large. Solution:

Increase Max Characters to reduce chunk count
Consider processing document in sections
Check if this will impact embedding API costs

Issue: Chunks Too Small or Too Large

Cause: Suboptimal max characters or combine settings. Solution:

Adjust Max Characters based on actual chunk lengths
For By Title strategy, adjust Combine Under N Chars
Review chunk size distribution with analysis code (Example 6)

Integration with Other Nodes

Complete RAG Ingestion Pipeline

Read Document → Chunk Text → Prepare for Embeddings → OpenAI Generate Batch Embeddings
                                                              ↓
                                   LanceDB Add Data ← Function (merge vectors)

Simple Text to Vector Database

Extract Text → Chunk Text (Basic) → Prepare for Embeddings → Embeddings → Database

Multi-Document Processing

Loop (files) → Read Document → Chunk Text → Collect Chunks → Batch Process → Store

Quality Control Pipeline

Read Document → Chunk Text → Function (validate chunks) → Filter → Process

Performance Considerations

Speed: Basic strategy is faster than By Title strategy
Memory: Chunks are stored in memory; very large documents may need batching
Chunk Count: More chunks = more embedding API calls = higher cost
Quality vs Cost: Larger chunks = fewer embeddings but may reduce search precision

Chunking Strategy Comparison

Aspect	Basic (Size-based)	By Title (Semantic)
Speed	Faster	Slower (more processing)
Context	May break mid-topic	Preserves topical boundaries
Best For	Unstructured text	Structured documents
Chunk Quality	Variable	More consistent
Complexity	Simple	Intelligent
Input Required	String or elements	Elements (from Read Document)

Best Practices for RAG

Document Structure Matters:
- Use By Title strategy for structured documents (better retrieval)
- Use Basic strategy for unstructured content
Overlap is Critical:
- Always use some overlap (150-200 chars) to maintain context
- Questions spanning chunk boundaries are better answered with overlap
Chunk Size Optimization:
- Start with 1000 characters and adjust based on retrieval quality
- Monitor chunk size distribution
- Consider your typical query length when setting chunk size
Metadata Preservation:
- Chunks preserve page numbers when available
- Use metadata to cite sources in responses
- Track elementCount to understand chunk composition
Testing and Iteration:
- Test retrieval quality with different chunk sizes
- Analyze failed retrievals to optimize chunking parameters
- Balance cost (chunk count) vs quality (chunk size)

Advanced Use Cases

Dynamic Chunk Size Based on Content Type

// Before Chunk Text, in a Function node
const contentType = msg.documentType;

if (contentType === 'code-documentation') {
  msg.maxChars = 1500; // Larger for code examples
  msg.overlap = 300;
} else if (contentType === 'news-article') {
  msg.maxChars = 600; // Smaller for quick facts
  msg.overlap = 100;
} else {
  msg.maxChars = 1000; // Default
  msg.overlap = 200;
}

Hierarchical Chunking

// After Chunk Text
// Create parent-child relationships

msg.chunks.forEach((chunk, index) => {
  chunk.parentDoc = msg.sourceFilePath;
  chunk.chunkId = `${msg.documentId}_chunk_${index}`;

  // Link to previous/next chunks
  if (index > 0) chunk.previousChunk = `${msg.documentId}_chunk_${index - 1}`;
  if (index < msg.chunks.length - 1) chunk.nextChunk = `${msg.documentId}_chunk_${index + 1}`;
});

Common Properties​

Inputs​

Options​

Output​

Chunking Strategies Explained​

Basic Strategy (Size-based)​

By Title Strategy (Semantic)​

How It Works​

Requirements​

Error Handling​

Usage Examples​

Example 1: Basic RAG Pipeline​

Example 2: Plain Text Chunking​

Example 3: Custom Chunk Sizes for Different Models​

Example 4: Processing Chunks Individually​

Example 5: Filter Chunks by Content​

Example 6: Chunk Size Analysis​

Tips for Effective Use​

Common Errors and Solutions​

Error: "Text or elements input is required"​

Error: "Max Characters must be a positive number"​

Error: "Overlap cannot be negative"​

Error: "Input must be a text string or an array"​

Warning: Too Many Chunks Created​

Issue: Chunks Too Small or Too Large​

Integration with Other Nodes​

Complete RAG Ingestion Pipeline​

Simple Text to Vector Database​

Multi-Document Processing​

Quality Control Pipeline​

Performance Considerations​

Chunking Strategy Comparison​

Best Practices for RAG​

Advanced Use Cases​

Dynamic Chunk Size Based on Content Type​

Hierarchical Chunking​