Read Document
Parse any document (PDF, DOCX, PPTX, TXT, HTML, MD, etc.) into structured elements with text and metadata. This node provides detailed document structure analysis, making it ideal for understanding document layout and extracting specific sections.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- File Path - Path to the document file to parse. Supports PDF, DOCX, PPTX, TXT, MD, HTML, EML, RTF, EPUB, ODT, CSV, TSV, JSON, XML, and more.
Options
- PDF Strategy - Strategy for PDF parsing. Default is "Auto (Recommended)".
- Auto (Recommended) - Automatically selects the best strategy for the PDF
- Fast (Text-based PDFs) - Optimized for PDFs with selectable text (faster processing)
- Hi-Res (Complex layouts) - Best for documents with complex layouts, tables, or mixed content
- OCR Only (Scanned docs) - Uses optical character recognition for image-only PDFs or scanned documents
- Include Metadata - Include element metadata such as page numbers, coordinates, and languages. Default is true.
Output
- Elements - Array of parsed elements, each containing:
text- The extracted text contenttype- Element type (Title, NarrativeText, ListItem, Table, Image, Header, Footer)metadata- (if enabled) Page number, filename, coordinates, languages
- Element Count - Number of elements extracted from the document.
- Full Text - All text concatenated as a single string for convenience.
Element Types
The Read Document node categorizes content into the following element types:
- Title - Document titles and section headings
- NarrativeText - Paragraphs and body text
- ListItem - Bulleted or numbered list items
- Table - Table elements (use Extract Tables node for structured table data)
- Image - Image elements
- Header - Page headers
- Footer - Page footers
How It Works
The Read Document node uses the Unstructured library to intelligently parse documents. When executed, the node:
- Validates the provided file path
- Automatically detects the document format based on file extension
- Applies the appropriate parsing strategy (especially for PDFs)
- Extracts all elements with their text content and type
- Captures metadata for each element (page numbers, coordinates, etc.)
- Returns structured elements array and concatenated full text
Requirements
- Valid file path to a supported document format
- Read access to the file
- Sufficient memory for large documents
Error Handling
The node will return specific errors in the following cases:
- Empty or missing file path
- File not found at the specified path
- Unsupported document format
- Corrupted or invalid document file
- Insufficient memory for very large documents
Usage Examples
Example 1: Extract Document Structure for RAG
Process a PDF and prepare it for semantic search:
// Read Document output
msg.elements = [
{ text: "Introduction to Machine Learning", type: "Title", metadata: { page: 1 } },
{ text: "Machine learning is a subset of artificial intelligence...", type: "NarrativeText", metadata: { page: 1 } },
{ text: "What is supervised learning?", type: "Title", metadata: { page: 2 } },
// ... more elements
]
msg.elementCount = 45
msg.text = "Introduction to Machine Learning\n\nMachine learning is..."
Connect to Chunk Text node to split into embedding-ready chunks while preserving semantic boundaries.
Example 2: Filter Specific Element Types
Use a Function node to extract only titles:
// After Read Document
const titles = msg.elements.filter(el => el.type === 'Title');
msg.titles = titles.map(el => el.text);
// Output: ["Introduction", "Chapter 1", "Chapter 2", ...]
Example 3: Process Multi-Page PDF with Page Tracking
// Read Document output includes page numbers
msg.elements.forEach(element => {
console.log(`Page ${element.metadata.page}: ${element.text.substring(0, 50)}...`);
});
Tips for Effective Use
-
Choose the Right PDF Strategy:
- Use Auto for general documents - it works well in most cases
- Use Fast when you know the PDF has selectable text and no images
- Use Hi-Res for scanned documents, complex layouts, or documents with tables
- Use OCR Only specifically for image-based PDFs without embedded text
-
Memory Considerations:
- Very large documents (100+ pages) may require significant memory
- Consider processing large documents in batches or pages if memory is limited
-
For RAG Workflows:
- Connect directly to Chunk Text node with "By Title" strategy to preserve document structure
- Keep metadata enabled to track source pages in your vector database
-
Element Type Filtering:
- Use a Function node after Read Document to filter specific element types
- Filter out Headers/Footers if they contain repetitive navigation text
-
Text Extraction vs Structured Parsing:
- Use Read Document when you need document structure and element types
- Use Extract Text when you only need plain text content (faster)
Common Errors and Solutions
Error: "File not found"
Cause: The specified file path doesn't exist or is incorrect. Solution: Verify the file path is correct and the file exists. Use absolute paths when possible.
Error: "Failed to parse document"
Cause: Document is corrupted, password-protected, or in an unsupported format. Solution:
- Verify the document opens correctly in its native application
- Remove password protection if present
- Convert to a supported format if needed
Error: "Strategy not supported"
Cause: Selected PDF strategy is not available for the given document type. Solution: Use "Auto" strategy or ensure the file is actually a PDF.
Integration with Other Nodes
Complete RAG Pipeline
Read Document → Chunk Text → Prepare for Embeddings → OpenAI Generate Batch Embeddings
↓
LanceDB Add Data ← Function (merge vectors)
Table Extraction Pipeline
Read Document → Function (filter Table elements) → Extract Tables → Process Data
Document Validation
Get Document Info → Switch (check isSupported) → Read Document → Process Content