Extract Text

Simple and fast text extraction from documents. Use this when you need plain text content without structural information. For detailed document structure analysis, use the Read Document node instead.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

File Path - Path to the document file. Supports PDF, DOCX, PPTX, TXT, MD, HTML, EML, RTF, EPUB, ODT, and more.

Options

Element Separator - Text to insert between document elements. Default is \n\n (double newline). You can use custom separators like:
- \n\n - Double newline (default, good readability)
- \n - Single newline (compact)
- - Single space (for continuous text)
- Custom text like | or ---
Include Page Breaks - Add page markers (e.g., "--- Page 2 ---") between pages. Default is false. Useful for tracking content location.

Output

Text - Extracted text content as a single string.
Character Count - Total number of characters in the extracted text.
Page Count - Number of pages in the document (if detectable).

How It Works

The Extract Text node provides a simplified text extraction workflow. When executed, the node:

Validates the provided file path
Automatically detects and parses the document format
Extracts all text elements from the document
Joins text with the specified separator
Optionally adds page break markers
Returns the complete text string with character and page counts

Requirements

Valid file path to a supported document format
Read access to the file
Sufficient memory for large documents

Error Handling

The node will return specific errors in the following cases:

Empty or missing file path
File not found at the specified path
Unsupported document format
Corrupted or invalid document file
Insufficient permissions to read the file

Usage Examples

Example 1: Basic Text Extraction

Extract text from a PDF for processing:

// Input
msg.filePath = "/documents/report.pdf"

// Extract Text output
msg.text = "Annual Report 2024\n\nExecutive Summary\n\nThis year has seen..."
msg.charCount = 15420
msg.pageCount = 12

Example 2: Extract with Page Markers

Enable page tracking in extracted text:

// With "Include Page Breaks" enabled
msg.text = "Introduction\n\nWelcome to our guide...\n\n--- Page 2 ---\n\nChapter 1..."

Use this to maintain page references when processing the text further.

Example 3: Custom Separator for Data Processing

Use space separator for compact text:

// Element Separator: " "
msg.text = "Title Paragraph 1 Paragraph 2 List item 1 List item 2"

Example 4: Process Multiple Documents

Loop through a folder of documents:

// In a Loop node over file paths
const files = ["/docs/file1.pdf", "/docs/file2.docx", "/docs/file3.txt"];

// Extract Text processes each file
// Collect results in an array
if (!msg.allTexts) msg.allTexts = [];
msg.allTexts.push({
  filename: msg.filePath,
  text: msg.text,
  charCount: msg.charCount
});

Example 5: Quick Text Analysis

Combine with string operations:

// After Extract Text
const wordCount = msg.text.split(/\s+/).length;
const avgWordsPerPage = msg.pageCount > 0 ? wordCount / msg.pageCount : 0;

msg.stats = {
  characters: msg.charCount,
  words: wordCount,
  pages: msg.pageCount,
  avgWordsPerPage: Math.round(avgWordsPerPage)
};

// Output: { characters: 15420, words: 2850, pages: 12, avgWordsPerPage: 238 }

Tips for Effective Use

When to Use Extract Text vs Read Document:
- Use Extract Text when you only need the text content
- Use Read Document when you need element types, structure, or selective extraction
- Extract Text is generally faster for simple text extraction
Separator Selection:
- Use \n\n (default) for good readability and natural paragraph separation
- Use single space for text that will be processed by NLP models
- Use custom separators when you need to split the text later
Page Break Markers:
- Enable when you need to track which page content came from
- Useful for citation and reference purposes
- Keep disabled for cleaner text in most processing scenarios
Memory Optimization:
- For very large documents, consider the character count output to estimate memory usage
- Process in batches if dealing with many large files
For AI/RAG Workflows:
- Extract Text output can be passed directly to Chunk Text node
- Use double newline separator to help chunking algorithms identify paragraph boundaries

Common Errors and Solutions

Error: "File path is required"

Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.

Error: "File does not exist"

Cause: The specified file path is invalid or the file doesn't exist. Solution:

Check the file path for typos
Verify the file exists at that location
Use absolute paths to avoid ambiguity
Ensure the file hasn't been moved or deleted

Error: "Failed to extract text"

Cause: The document is corrupted, password-protected, or in an unsupported format. Solution:

Verify the file opens correctly in its native application
Remove password protection if present
Check that the file extension matches the actual file type
Try using Read Document node for more robust parsing

Warning: Page Count is 0

Cause: Page count is not available for the document type. Solution: This is normal for some formats like TXT or HTML. Only certain formats (PDF, DOCX) reliably provide page counts.

Integration with Other Nodes

Simple Text Processing Pipeline

File System: Read File → Extract Text → Function (process text) → Save Results

Text to Chunks for Embeddings

Extract Text → Chunk Text (Basic strategy) → Prepare for Embeddings → OpenAI

Multi-Document Processing

Loop (files) → Extract Text → Append to Array → Save Combined Results

Document Search and Filter

Loop (documents) → Extract Text → Function (search for keywords) → Filter → Process Matches

Performance Considerations

Speed: Extract Text is optimized for speed and is generally faster than Read Document
Memory: Text output is stored as a string; very large documents may use significant memory
File Size: Documents over 100MB may take considerable time to process
Format Impact: PDF processing is slower than plain text formats

Comparison with Read Document

Feature	Extract Text	Read Document
Speed	Faster	Slower (more processing)
Output	Plain text only	Structured elements array
Element Types	No	Yes (Title, NarrativeText, etc.)
Metadata	No	Yes (page numbers, coordinates)
Use Case	Simple text extraction	Structure-aware processing
Memory	Lower	Higher
Best For	Text analysis, basic RAG	Semantic chunking, selective extraction

Common Properties​

Inputs​

Options​

Output​

How It Works​

Requirements​

Error Handling​

Usage Examples​

Example 1: Basic Text Extraction​

Example 2: Extract with Page Markers​

Example 3: Custom Separator for Data Processing​

Example 4: Process Multiple Documents​

Example 5: Quick Text Analysis​

Tips for Effective Use​

Common Errors and Solutions​

Error: "File path is required"​

Error: "File does not exist"​

Error: "Failed to extract text"​

Warning: Page Count is 0​

Integration with Other Nodes​

Simple Text Processing Pipeline​

Text to Chunks for Embeddings​

Multi-Document Processing​

Document Search and Filter​

Performance Considerations​

Comparison with Read Document​