Read Document

Parse any document (PDF, DOCX, PPTX, TXT, HTML, MD, etc.) into structured elements with text and metadata. This node provides detailed document structure analysis, making it ideal for understanding document layout and extracting specific sections.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

File Path - Path to the document file to parse. Supports PDF, DOCX, PPTX, TXT, MD, HTML, EML, RTF, EPUB, ODT, CSV, TSV, JSON, XML, and more.

Options

PDF Strategy - Strategy for PDF parsing. Default is "Auto (Recommended)".
- Auto (Recommended) - Automatically selects the best strategy for the PDF
- Fast (Text-based PDFs) - Optimized for PDFs with selectable text (faster processing)
- Hi-Res (Complex layouts) - Best for documents with complex layouts, tables, or mixed content
- OCR Only (Scanned docs) - Uses optical character recognition for image-only PDFs or scanned documents
Include Metadata - Include element metadata such as page numbers, coordinates, and languages. Default is true.

Output

Elements - Array of parsed elements, each containing:
- text - The extracted text content
- type - Element type (Title, NarrativeText, ListItem, Table, Image, Header, Footer)
- metadata - (if enabled) Page number, filename, coordinates, languages
Element Count - Number of elements extracted from the document.
Full Text - All text concatenated as a single string for convenience.

Element Types

The Read Document node categorizes content into the following element types:

Title - Document titles and section headings
NarrativeText - Paragraphs and body text
ListItem - Bulleted or numbered list items
Table - Table elements (use Extract Tables node for structured table data)
Image - Image elements
Header - Page headers
Footer - Page footers

How It Works

The Read Document node uses the Unstructured library to intelligently parse documents. When executed, the node:

Validates the provided file path
Automatically detects the document format based on file extension
Applies the appropriate parsing strategy (especially for PDFs)
Extracts all elements with their text content and type
Captures metadata for each element (page numbers, coordinates, etc.)
Returns structured elements array and concatenated full text

Requirements

Valid file path to a supported document format
Read access to the file
Sufficient memory for large documents

Error Handling

The node will return specific errors in the following cases:

Empty or missing file path
File not found at the specified path
Unsupported document format
Corrupted or invalid document file
Insufficient memory for very large documents

Usage Examples

Example 1: Extract Document Structure for RAG

Process a PDF and prepare it for semantic search:

// Read Document output
msg.elements = [
  { text: "Introduction to Machine Learning", type: "Title", metadata: { page: 1 } },
  { text: "Machine learning is a subset of artificial intelligence...", type: "NarrativeText", metadata: { page: 1 } },
  { text: "What is supervised learning?", type: "Title", metadata: { page: 2 } },
  // ... more elements
]
msg.elementCount = 45
msg.text = "Introduction to Machine Learning\n\nMachine learning is..."

Connect to Chunk Text node to split into embedding-ready chunks while preserving semantic boundaries.

Example 2: Filter Specific Element Types

Use a Function node to extract only titles:

// After Read Document
const titles = msg.elements.filter(el => el.type === 'Title');
msg.titles = titles.map(el => el.text);
// Output: ["Introduction", "Chapter 1", "Chapter 2", ...]

Example 3: Process Multi-Page PDF with Page Tracking

// Read Document output includes page numbers
msg.elements.forEach(element => {
  console.log(`Page ${element.metadata.page}: ${element.text.substring(0, 50)}...`);
});

Tips for Effective Use

Choose the Right PDF Strategy:
- Use Auto for general documents - it works well in most cases
- Use Fast when you know the PDF has selectable text and no images
- Use Hi-Res for scanned documents, complex layouts, or documents with tables
- Use OCR Only specifically for image-based PDFs without embedded text
Memory Considerations:
- Very large documents (100+ pages) may require significant memory
- Consider processing large documents in batches or pages if memory is limited
For RAG Workflows:
- Connect directly to Chunk Text node with "By Title" strategy to preserve document structure
- Keep metadata enabled to track source pages in your vector database
Element Type Filtering:
- Use a Function node after Read Document to filter specific element types
- Filter out Headers/Footers if they contain repetitive navigation text
Text Extraction vs Structured Parsing:
- Use Read Document when you need document structure and element types
- Use Extract Text when you only need plain text content (faster)

Common Errors and Solutions

Error: "File not found"

Cause: The specified file path doesn't exist or is incorrect. Solution: Verify the file path is correct and the file exists. Use absolute paths when possible.

Error: "Failed to parse document"

Cause: Document is corrupted, password-protected, or in an unsupported format. Solution:

Verify the document opens correctly in its native application
Remove password protection if present
Convert to a supported format if needed

Error: "Strategy not supported"

Cause: Selected PDF strategy is not available for the given document type. Solution: Use "Auto" strategy or ensure the file is actually a PDF.

Integration with Other Nodes

Complete RAG Pipeline

Read Document → Chunk Text → Prepare for Embeddings → OpenAI Generate Batch Embeddings
                                                              ↓
                                   LanceDB Add Data ← Function (merge vectors)

Table Extraction Pipeline

Read Document → Function (filter Table elements) → Extract Tables → Process Data

Document Validation

Get Document Info → Switch (check isSupported) → Read Document → Process Content

Common Properties​

Inputs​

Options​

Output​

Element Types​

How It Works​

Requirements​

Error Handling​

Usage Examples​

Example 1: Extract Document Structure for RAG​

Example 2: Filter Specific Element Types​

Example 3: Process Multi-Page PDF with Page Tracking​

Tips for Effective Use​

Common Errors and Solutions​

Error: "File not found"​

Error: "Failed to parse document"​

Error: "Strategy not supported"​

Integration with Other Nodes​

Complete RAG Pipeline​

Table Extraction Pipeline​

Document Validation​