Prepare for Embeddings
Format document chunks for direct use with embedding APIs and vector databases. This node bridges the gap between chunked documents and vector storage, preparing data in the exact format needed for OpenAI embeddings and LanceDB.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Chunks - Array of chunks from the Chunk Text node.
- Source File - (Optional) Source filename or path to include in metadata. The node automatically extracts just the filename if a full path is provided.
Options
- ID Prefix - Prefix for generated unique IDs. Default is "doc". Each record gets an ID like
doc_0,doc_1, etc. - Include Metadata - Include source file and page information in records. Default is true. Metadata helps track the origin of each chunk in your vector database.
Output
- Texts - Simple array of text strings, ready to pass directly to OpenAI Generate Batch Embeddings node. Format:
["text1", "text2", "text3"] - Records - Array of record objects ready for LanceDB (after adding vectors). Each record includes:
id- Unique identifier (e.g., "doc_0", "doc_1")text- The chunk text contentchunk_index- Position in the original chunk array (0-based)source- (if metadata enabled) Source filenamepage- (if metadata enabled and available) Page numberelement_count- (if metadata enabled and available) Number of elements in chunk
- Count - Number of items prepared for embedding.
How It Works
The Prepare for Embeddings node transforms chunks into the specific formats required by embedding APIs and vector databases. When executed, the node:
- Validates the input chunks array
- Extracts text from each chunk (handles both string and object formats)
- Generates unique IDs using the specified prefix
- Creates a simple texts array for the embedding API
- Creates structured records with IDs and metadata for the database
- Preserves chunk metadata (page numbers, element counts) when available
- Returns both formats ready for the next steps in your pipeline
Requirements
- Valid chunks array from Chunk Text node
- Chunks must contain text content
Error Handling
The node will return specific errors in the following cases:
- Empty or missing chunks input
- Chunks input is not an array
- Chunks contain no text content
Usage Examples
Example 1: Complete RAG Pipeline
// Chunk Text output
msg.chunks = [
{ text: "Introduction to AI...", index: 0, metadata: { page: 1 } },
{ text: "Machine learning basics...", index: 1, metadata: { page: 2 } }
]
msg.sourceFile = "/documents/ai-guide.pdf"
// Prepare for Embeddings output
msg.texts = [
"Introduction to AI...",
"Machine learning basics..."
]
msg.records = [
{
id: "doc_0",
text: "Introduction to AI...",
chunk_index: 0,
source: "ai-guide.pdf",
page: 1
},
{
id: "doc_1",
text: "Machine learning basics...",
chunk_index: 1,
source: "ai-guide.pdf",
page: 2
}
]
msg.count = 2
// Next: Connect msg.texts to OpenAI Generate Batch Embeddings
// Then: Merge vectors with records and insert to LanceDB
Example 2: Merge Embeddings with Records
// Function node AFTER OpenAI Generate Batch Embeddings
// This merges the vectors with the records
const records = msg.records;
const embeddings = msg.embeddings; // from OpenAI node
// Combine vectors with record metadata
msg.data = records.map((record, i) => ({
...record,
vector: embeddings[i]
}));
// msg.data now ready for LanceDB Add Data node
// Format:
// [
// {
// id: "doc_0",
// text: "Introduction to AI...",
// chunk_index: 0,
// source: "ai-guide.pdf",
// page: 1,
// vector: [0.123, -0.456, 0.789, ...]
// },
// ...
// ]
Example 3: Custom ID Prefix for Multi-Document Processing
// Process multiple documents with unique IDs
// Document 1
msg.sourceFile = "/docs/report-2024.pdf"
// Set ID Prefix: "report2024"
// Output: { records: [{ id: "report2024_0", ... }, { id: "report2024_1", ... }] }
// Document 2
msg.sourceFile = "/docs/manual.pdf"
// Set ID Prefix: "manual"
// Output: { records: [{ id: "manual_0", ... }, { id: "manual_1", ... }] }
// All records have unique IDs across documents
Example 4: Batch Processing Multiple Documents
// Loop over multiple documents
const documents = [
"/docs/doc1.pdf",
"/docs/doc2.pdf",
"/docs/doc3.pdf"
];
// After Prepare for Embeddings in loop
if (!msg.allTexts) {
msg.allTexts = [];
msg.allRecords = [];
}
msg.allTexts.push(...msg.texts);
msg.allRecords.push(...msg.records);
// After loop completes, process all at once
// msg.allTexts → OpenAI Generate Batch Embeddings
// Merge with msg.allRecords → LanceDB
Example 5: Add Custom Metadata
// Function node AFTER Prepare for Embeddings
// Add additional metadata before database insertion
msg.records = msg.records.map(record => ({
...record,
documentType: msg.documentType,
uploadedBy: msg.userId,
uploadedAt: new Date().toISOString(),
language: "en",
category: msg.category
}));
// Enhanced records ready for database
Example 6: Filter and Prepare Specific Chunks
// Before Prepare for Embeddings
// Filter chunks based on criteria
msg.chunks = msg.chunks.filter(chunk => {
// Only include chunks with sufficient length
return chunk.text.length > 100;
});
// Then connect to Prepare for Embeddings
// Only filtered chunks will be embedded
Tips for Effective Use
-
ID Prefix Strategy:
- Use document names or IDs as prefixes for easy tracking
- For multi-document processing, ensure unique prefixes
- Use descriptive prefixes: "manual", "report2024", "kb-article"
-
Source File Best Practices:
- Always provide source file for traceability
- The node automatically extracts filename from full paths
- Use relative or absolute paths - both work
-
Metadata Usage:
- Keep metadata enabled for production systems (helps with debugging)
- Page numbers are crucial for citation and source attribution
- Element count helps understand chunk composition
-
Output Format Understanding:
textsarray is for embedding generation onlyrecordsarray is your database schema (minus vectors)- Keep both outputs - you need them at different pipeline stages
-
Integration Pattern:
Prepare for Embeddings
├─ texts → OpenAI Generate Batch Embeddings → embeddings
└─ records → Function (merge with embeddings) → LanceDB
Common Errors and Solutions
Error: "Chunks input is required"
Cause: No chunks array was provided. Solution: Connect the Chunks output from Chunk Text node to this node's input.
Error: "Chunks must be an array"
Cause: Input is not an array. Solution: Ensure you're connecting the chunks output (array) from Chunk Text node, not individual chunk objects.
Warning: Empty chunks skipped
Cause: Some chunks in the array have empty text. Solution: This is normal - the node automatically skips empty chunks. Check your chunking parameters if you're getting too many empty chunks.
Issue: Count is Lower Than Expected
Cause: Some chunks had no text content and were skipped. Solution:
- Review Chunk Text output to ensure chunks have content
- Check for filtering that might have removed text
- This is usually not an error - empty chunks are safely ignored
Issue: Missing page numbers in records
Cause: Source document format doesn't provide page numbers, or metadata was not preserved during chunking. Solution:
- This is normal for formats like TXT, MD, HTML
- Ensure Read Document → Chunk Text preserves metadata
- Page numbers are optional - records are still valid without them
Integration with Other Nodes
Standard RAG Pipeline
Read Document → Chunk Text → Prepare for Embeddings
↓
texts ↓ records
↓
OpenAI Generate Batch Embeddings
↓
Function (merge)
↓
LanceDB Add Data
Multi-Document Processing
Loop (files) → Read Document → Chunk Text → Prepare for Embeddings
↓
Collect All Texts/Records
↓
Batch Embed → Store
Selective Embedding
Chunk Text → Function (filter chunks) → Prepare for Embeddings → Embed → Store
Update Existing Knowledge Base
Prepare for Embeddings → Check Existing IDs → New Records → Embed → Insert
↓
Update Changed Records
Performance Considerations
- Speed: Very fast - simple data transformation
- Memory: Minimal - creates new arrays but no heavy processing
- Batch Size: Can handle thousands of chunks
- API Costs: Number of chunks = number of embedding API calls
Output Format Details
Texts Array Format
msg.texts = [
"First chunk text content here...",
"Second chunk text content here...",
"Third chunk text content here..."
]
This array goes directly to OpenAI Generate Batch Embeddings node.
Records Array Format
msg.records = [
{
id: "doc_0", // Unique ID
text: "First chunk...", // Text content
chunk_index: 0, // Position in original chunks
source: "document.pdf", // Source filename (if provided)
page: 1, // Page number (if available)
element_count: 3 // Elements in chunk (if available)
},
{
id: "doc_1",
text: "Second chunk...",
chunk_index: 1,
source: "document.pdf",
page: 1
}
]
After embedding, merge vectors to create final database records.
Best Practices for Production
-
Always Track Source:
- Provide source file for every document
- Use full paths during processing, node extracts filename
- Essential for debugging and user citations
-
Unique ID Strategy:
- Use document-specific prefixes
- Consider:
{collection}_{document}_{chunk} - Prevents ID collisions in multi-document scenarios
-
Metadata Management:
- Keep metadata enabled unless you have specific reasons not to
- Add custom metadata in a Function node after this node
- Standard metadata (page, source) should always be preserved
-
Error Handling:
- Use Try-Catch nodes when processing user uploads
- Validate count output matches expected chunks
- Log any discrepancies between chunk count and output count
-
Testing Pipeline:
- Test with single document first
- Verify texts and records arrays match
- Confirm IDs are unique across your dataset
Comparison with Manual Formatting
| Approach | Prepare for Embeddings | Manual Function Node |
|---|---|---|
| Ease of Use | Simple, no code needed | Requires JavaScript |
| Consistency | Guaranteed format | Manual validation needed |
| Metadata | Auto-preserves from chunks | Manual extraction |
| ID Generation | Automatic with prefix | Manual implementation |
| Speed | Optimized | Depends on code |
| Best For | Standard RAG pipelines | Custom requirements |
Advanced Use Cases
Dynamic ID Generation Based on Content
// Function node AFTER Prepare for Embeddings
// Replace auto-generated IDs with content-based IDs
msg.records = msg.records.map(record => {
// Generate ID from content hash or other logic
const contentHash = hashFunction(record.text);
return {
...record,
id: `${msg.documentType}_${contentHash}`,
originalId: record.id // Keep original for reference
};
});
Incremental Updates
// Check which chunks already exist in database
// Only prepare new/changed chunks for embedding
const existingIds = msg.existingChunkIds; // from database query
const newChunks = msg.chunks.filter((chunk, i) => {
const potentialId = `${msg.idPrefix}_${i}`;
return !existingIds.includes(potentialId);
});
msg.chunks = newChunks;
// Then connect to Prepare for Embeddings
Multi-Language Document Processing
// After Prepare for Embeddings
// Add language detection
msg.records = await Promise.all(
msg.records.map(async record => ({
...record,
language: await detectLanguage(record.text),
translatedText: await translateIfNeeded(record.text)
}))
);