Prepare for Embeddings

Format document chunks for direct use with embedding APIs and vector databases. This node bridges the gap between chunked documents and vector storage, preparing data in the exact format needed for OpenAI embeddings and LanceDB.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Chunks - Array of chunks from the Chunk Text node.
Source File - (Optional) Source filename or path to include in metadata. The node automatically extracts just the filename if a full path is provided.

Options

ID Prefix - Prefix for generated unique IDs. Default is "doc". Each record gets an ID like doc_0, doc_1, etc.
Include Metadata - Include source file and page information in records. Default is true. Metadata helps track the origin of each chunk in your vector database.

Output

Texts - Simple array of text strings, ready to pass directly to OpenAI Generate Batch Embeddings node. Format: ["text1", "text2", "text3"]
Records - Array of record objects ready for LanceDB (after adding vectors). Each record includes:
- id - Unique identifier (e.g., "doc_0", "doc_1")
- text - The chunk text content
- chunk_index - Position in the original chunk array (0-based)
- source - (if metadata enabled) Source filename
- page - (if metadata enabled and available) Page number
- element_count - (if metadata enabled and available) Number of elements in chunk
Count - Number of items prepared for embedding.

How It Works

The Prepare for Embeddings node transforms chunks into the specific formats required by embedding APIs and vector databases. When executed, the node:

Validates the input chunks array
Extracts text from each chunk (handles both string and object formats)
Generates unique IDs using the specified prefix
Creates a simple texts array for the embedding API
Creates structured records with IDs and metadata for the database
Preserves chunk metadata (page numbers, element counts) when available
Returns both formats ready for the next steps in your pipeline

Requirements

Valid chunks array from Chunk Text node
Chunks must contain text content

Error Handling

The node will return specific errors in the following cases:

Empty or missing chunks input
Chunks input is not an array
Chunks contain no text content

Usage Examples

Example 1: Complete RAG Pipeline

// Chunk Text output
msg.chunks = [
  { text: "Introduction to AI...", index: 0, metadata: { page: 1 } },
  { text: "Machine learning basics...", index: 1, metadata: { page: 2 } }
]
msg.sourceFile = "/documents/ai-guide.pdf"

// Prepare for Embeddings output
msg.texts = [
  "Introduction to AI...",
  "Machine learning basics..."
]

msg.records = [
  {
    id: "doc_0",
    text: "Introduction to AI...",
    chunk_index: 0,
    source: "ai-guide.pdf",
    page: 1
  },
  {
    id: "doc_1",
    text: "Machine learning basics...",
    chunk_index: 1,
    source: "ai-guide.pdf",
    page: 2
  }
]

msg.count = 2

// Next: Connect msg.texts to OpenAI Generate Batch Embeddings
// Then: Merge vectors with records and insert to LanceDB

Example 2: Merge Embeddings with Records

// Function node AFTER OpenAI Generate Batch Embeddings
// This merges the vectors with the records

const records = msg.records;
const embeddings = msg.embeddings; // from OpenAI node

// Combine vectors with record metadata
msg.data = records.map((record, i) => ({
  ...record,
  vector: embeddings[i]
}));

// msg.data now ready for LanceDB Add Data node
// Format:
// [
//   {
//     id: "doc_0",
//     text: "Introduction to AI...",
//     chunk_index: 0,
//     source: "ai-guide.pdf",
//     page: 1,
//     vector: [0.123, -0.456, 0.789, ...]
//   },
//   ...
// ]

Example 3: Custom ID Prefix for Multi-Document Processing

// Process multiple documents with unique IDs

// Document 1
msg.sourceFile = "/docs/report-2024.pdf"
// Set ID Prefix: "report2024"
// Output: { records: [{ id: "report2024_0", ... }, { id: "report2024_1", ... }] }

// Document 2
msg.sourceFile = "/docs/manual.pdf"
// Set ID Prefix: "manual"
// Output: { records: [{ id: "manual_0", ... }, { id: "manual_1", ... }] }

// All records have unique IDs across documents

Example 4: Batch Processing Multiple Documents

// Loop over multiple documents
const documents = [
  "/docs/doc1.pdf",
  "/docs/doc2.pdf",
  "/docs/doc3.pdf"
];

// After Prepare for Embeddings in loop
if (!msg.allTexts) {
  msg.allTexts = [];
  msg.allRecords = [];
}

msg.allTexts.push(...msg.texts);
msg.allRecords.push(...msg.records);

// After loop completes, process all at once
// msg.allTexts → OpenAI Generate Batch Embeddings
// Merge with msg.allRecords → LanceDB

Example 5: Add Custom Metadata

// Function node AFTER Prepare for Embeddings
// Add additional metadata before database insertion

msg.records = msg.records.map(record => ({
  ...record,
  documentType: msg.documentType,
  uploadedBy: msg.userId,
  uploadedAt: new Date().toISOString(),
  language: "en",
  category: msg.category
}));

// Enhanced records ready for database

Example 6: Filter and Prepare Specific Chunks

// Before Prepare for Embeddings
// Filter chunks based on criteria

msg.chunks = msg.chunks.filter(chunk => {
  // Only include chunks with sufficient length
  return chunk.text.length > 100;
});

// Then connect to Prepare for Embeddings
// Only filtered chunks will be embedded

Tips for Effective Use

ID Prefix Strategy:
- Use document names or IDs as prefixes for easy tracking
- For multi-document processing, ensure unique prefixes
- Use descriptive prefixes: "manual", "report2024", "kb-article"
Source File Best Practices:
- Always provide source file for traceability
- The node automatically extracts filename from full paths
- Use relative or absolute paths - both work
Metadata Usage:
- Keep metadata enabled for production systems (helps with debugging)
- Page numbers are crucial for citation and source attribution
- Element count helps understand chunk composition
Output Format Understanding:
- texts array is for embedding generation only
- records array is your database schema (minus vectors)
- Keep both outputs - you need them at different pipeline stages

Integration Pattern:

Prepare for Embeddings
├─ texts → OpenAI Generate Batch Embeddings → embeddings
└─ records → Function (merge with embeddings) → LanceDB

Common Errors and Solutions

Error: "Chunks input is required"

Cause: No chunks array was provided. Solution: Connect the Chunks output from Chunk Text node to this node's input.

Error: "Chunks must be an array"

Cause: Input is not an array. Solution: Ensure you're connecting the chunks output (array) from Chunk Text node, not individual chunk objects.

Warning: Empty chunks skipped

Cause: Some chunks in the array have empty text. Solution: This is normal - the node automatically skips empty chunks. Check your chunking parameters if you're getting too many empty chunks.

Issue: Count is Lower Than Expected

Cause: Some chunks had no text content and were skipped. Solution:

Review Chunk Text output to ensure chunks have content
Check for filtering that might have removed text
This is usually not an error - empty chunks are safely ignored

Issue: Missing page numbers in records

Cause: Source document format doesn't provide page numbers, or metadata was not preserved during chunking. Solution:

This is normal for formats like TXT, MD, HTML
Ensure Read Document → Chunk Text preserves metadata
Page numbers are optional - records are still valid without them

Integration with Other Nodes

Standard RAG Pipeline

Read Document → Chunk Text → Prepare for Embeddings
                                    ↓
                            texts   ↓   records
                                    ↓
                    OpenAI Generate Batch Embeddings
                                    ↓
                            Function (merge)
                                    ↓
                            LanceDB Add Data

Multi-Document Processing

Loop (files) → Read Document → Chunk Text → Prepare for Embeddings
                                                      ↓
                                            Collect All Texts/Records
                                                      ↓
                                            Batch Embed → Store

Selective Embedding

Chunk Text → Function (filter chunks) → Prepare for Embeddings → Embed → Store

Update Existing Knowledge Base

Prepare for Embeddings → Check Existing IDs → New Records → Embed → Insert
                              ↓
                        Update Changed Records

Performance Considerations

Speed: Very fast - simple data transformation
Memory: Minimal - creates new arrays but no heavy processing
Batch Size: Can handle thousands of chunks
API Costs: Number of chunks = number of embedding API calls

Output Format Details

Texts Array Format

msg.texts = [
  "First chunk text content here...",
  "Second chunk text content here...",
  "Third chunk text content here..."
]

This array goes directly to OpenAI Generate Batch Embeddings node.

Records Array Format

msg.records = [
  {
    id: "doc_0",              // Unique ID
    text: "First chunk...",   // Text content
    chunk_index: 0,           // Position in original chunks
    source: "document.pdf",   // Source filename (if provided)
    page: 1,                  // Page number (if available)
    element_count: 3          // Elements in chunk (if available)
  },
  {
    id: "doc_1",
    text: "Second chunk...",
    chunk_index: 1,
    source: "document.pdf",
    page: 1
  }
]

After embedding, merge vectors to create final database records.

Best Practices for Production

Always Track Source:
- Provide source file for every document
- Use full paths during processing, node extracts filename
- Essential for debugging and user citations
Unique ID Strategy:
- Use document-specific prefixes
- Consider: {collection}_{document}_{chunk}
- Prevents ID collisions in multi-document scenarios
Metadata Management:
- Keep metadata enabled unless you have specific reasons not to
- Add custom metadata in a Function node after this node
- Standard metadata (page, source) should always be preserved
Error Handling:
- Use Try-Catch nodes when processing user uploads
- Validate count output matches expected chunks
- Log any discrepancies between chunk count and output count
Testing Pipeline:
- Test with single document first
- Verify texts and records arrays match
- Confirm IDs are unique across your dataset

Comparison with Manual Formatting

Approach	Prepare for Embeddings	Manual Function Node
Ease of Use	Simple, no code needed	Requires JavaScript
Consistency	Guaranteed format	Manual validation needed
Metadata	Auto-preserves from chunks	Manual extraction
ID Generation	Automatic with prefix	Manual implementation
Speed	Optimized	Depends on code
Best For	Standard RAG pipelines	Custom requirements

Advanced Use Cases

Dynamic ID Generation Based on Content

// Function node AFTER Prepare for Embeddings
// Replace auto-generated IDs with content-based IDs

msg.records = msg.records.map(record => {
  // Generate ID from content hash or other logic
  const contentHash = hashFunction(record.text);
  return {
    ...record,
    id: `${msg.documentType}_${contentHash}`,
    originalId: record.id // Keep original for reference
  };
});

Incremental Updates

// Check which chunks already exist in database
// Only prepare new/changed chunks for embedding

const existingIds = msg.existingChunkIds; // from database query
const newChunks = msg.chunks.filter((chunk, i) => {
  const potentialId = `${msg.idPrefix}_${i}`;
  return !existingIds.includes(potentialId);
});

msg.chunks = newChunks;
// Then connect to Prepare for Embeddings

Multi-Language Document Processing

// After Prepare for Embeddings
// Add language detection

msg.records = await Promise.all(
  msg.records.map(async record => ({
    ...record,
    language: await detectLanguage(record.text),
    translatedText: await translateIfNeeded(record.text)
  }))
);

Common Properties​

Inputs​

Options​

Output​

How It Works​

Requirements​

Error Handling​

Usage Examples​

Example 1: Complete RAG Pipeline​

Example 2: Merge Embeddings with Records​

Example 3: Custom ID Prefix for Multi-Document Processing​

Example 4: Batch Processing Multiple Documents​

Example 5: Add Custom Metadata​

Example 6: Filter and Prepare Specific Chunks​

Tips for Effective Use​

Common Errors and Solutions​

Error: "Chunks input is required"​

Error: "Chunks must be an array"​

Warning: Empty chunks skipped​

Issue: Count is Lower Than Expected​

Issue: Missing page numbers in records​

Integration with Other Nodes​

Standard RAG Pipeline​

Multi-Document Processing​

Selective Embedding​

Update Existing Knowledge Base​

Performance Considerations​

Output Format Details​

Texts Array Format​

Records Array Format​

Best Practices for Production​

Comparison with Manual Formatting​

Advanced Use Cases​

Dynamic ID Generation Based on Content​

Incremental Updates​

Multi-Language Document Processing​