Extract Text
Simple and fast text extraction from documents. Use this when you need plain text content without structural information. For detailed document structure analysis, use the Read Document node instead.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- File Path - Path to the document file. Supports PDF, DOCX, PPTX, TXT, MD, HTML, EML, RTF, EPUB, ODT, and more.
Options
- Element Separator - Text to insert between document elements. Default is
\n\n(double newline). You can use custom separators like:\n\n- Double newline (default, good readability)\n- Single newline (compact)- Single space (for continuous text)- Custom text like
|or---
- Include Page Breaks - Add page markers (e.g., "--- Page 2 ---") between pages. Default is false. Useful for tracking content location.
Output
- Text - Extracted text content as a single string.
- Character Count - Total number of characters in the extracted text.
- Page Count - Number of pages in the document (if detectable).
How It Works
The Extract Text node provides a simplified text extraction workflow. When executed, the node:
- Validates the provided file path
- Automatically detects and parses the document format
- Extracts all text elements from the document
- Joins text with the specified separator
- Optionally adds page break markers
- Returns the complete text string with character and page counts
Requirements
- Valid file path to a supported document format
- Read access to the file
- Sufficient memory for large documents
Error Handling
The node will return specific errors in the following cases:
- Empty or missing file path
- File not found at the specified path
- Unsupported document format
- Corrupted or invalid document file
- Insufficient permissions to read the file
Usage Examples
Example 1: Basic Text Extraction
Extract text from a PDF for processing:
// Input
msg.filePath = "/documents/report.pdf"
// Extract Text output
msg.text = "Annual Report 2024\n\nExecutive Summary\n\nThis year has seen..."
msg.charCount = 15420
msg.pageCount = 12
Example 2: Extract with Page Markers
Enable page tracking in extracted text:
// With "Include Page Breaks" enabled
msg.text = "Introduction\n\nWelcome to our guide...\n\n--- Page 2 ---\n\nChapter 1..."
Use this to maintain page references when processing the text further.
Example 3: Custom Separator for Data Processing
Use space separator for compact text:
// Element Separator: " "
msg.text = "Title Paragraph 1 Paragraph 2 List item 1 List item 2"
Example 4: Process Multiple Documents
Loop through a folder of documents:
// In a Loop node over file paths
const files = ["/docs/file1.pdf", "/docs/file2.docx", "/docs/file3.txt"];
// Extract Text processes each file
// Collect results in an array
if (!msg.allTexts) msg.allTexts = [];
msg.allTexts.push({
filename: msg.filePath,
text: msg.text,
charCount: msg.charCount
});
Example 5: Quick Text Analysis
Combine with string operations:
// After Extract Text
const wordCount = msg.text.split(/\s+/).length;
const avgWordsPerPage = msg.pageCount > 0 ? wordCount / msg.pageCount : 0;
msg.stats = {
characters: msg.charCount,
words: wordCount,
pages: msg.pageCount,
avgWordsPerPage: Math.round(avgWordsPerPage)
};
// Output: { characters: 15420, words: 2850, pages: 12, avgWordsPerPage: 238 }
Tips for Effective Use
-
When to Use Extract Text vs Read Document:
- Use Extract Text when you only need the text content
- Use Read Document when you need element types, structure, or selective extraction
- Extract Text is generally faster for simple text extraction
-
Separator Selection:
- Use
\n\n(default) for good readability and natural paragraph separation - Use single space for text that will be processed by NLP models
- Use custom separators when you need to split the text later
- Use
-
Page Break Markers:
- Enable when you need to track which page content came from
- Useful for citation and reference purposes
- Keep disabled for cleaner text in most processing scenarios
-
Memory Optimization:
- For very large documents, consider the character count output to estimate memory usage
- Process in batches if dealing with many large files
-
For AI/RAG Workflows:
- Extract Text output can be passed directly to Chunk Text node
- Use double newline separator to help chunking algorithms identify paragraph boundaries
Common Errors and Solutions
Error: "File path is required"
Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.
Error: "File does not exist"
Cause: The specified file path is invalid or the file doesn't exist. Solution:
- Check the file path for typos
- Verify the file exists at that location
- Use absolute paths to avoid ambiguity
- Ensure the file hasn't been moved or deleted
Error: "Failed to extract text"
Cause: The document is corrupted, password-protected, or in an unsupported format. Solution:
- Verify the file opens correctly in its native application
- Remove password protection if present
- Check that the file extension matches the actual file type
- Try using Read Document node for more robust parsing
Warning: Page Count is 0
Cause: Page count is not available for the document type. Solution: This is normal for some formats like TXT or HTML. Only certain formats (PDF, DOCX) reliably provide page counts.
Integration with Other Nodes
Simple Text Processing Pipeline
File System: Read File → Extract Text → Function (process text) → Save Results
Text to Chunks for Embeddings
Extract Text → Chunk Text (Basic strategy) → Prepare for Embeddings → OpenAI
Multi-Document Processing
Loop (files) → Extract Text → Append to Array → Save Combined Results
Document Search and Filter
Loop (documents) → Extract Text → Function (search for keywords) → Filter → Process Matches
Performance Considerations
- Speed: Extract Text is optimized for speed and is generally faster than Read Document
- Memory: Text output is stored as a string; very large documents may use significant memory
- File Size: Documents over 100MB may take considerable time to process
- Format Impact: PDF processing is slower than plain text formats
Comparison with Read Document
| Feature | Extract Text | Read Document |
|---|---|---|
| Speed | Faster | Slower (more processing) |
| Output | Plain text only | Structured elements array |
| Element Types | No | Yes (Title, NarrativeText, etc.) |
| Metadata | No | Yes (page numbers, coordinates) |
| Use Case | Simple text extraction | Structure-aware processing |
| Memory | Lower | Higher |
| Best For | Text analysis, basic RAG | Semantic chunking, selective extraction |