Skip to main content

Extract Text

Simple and fast text extraction from documents. Use this when you need plain text content without structural information. For detailed document structure analysis, use the Read Document node instead.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • File Path - Path to the document file. Supports PDF, DOCX, PPTX, TXT, MD, HTML, EML, RTF, EPUB, ODT, and more.

Options

  • Element Separator - Text to insert between document elements. Default is \n\n (double newline). You can use custom separators like:
    • \n\n - Double newline (default, good readability)
    • \n - Single newline (compact)
    • - Single space (for continuous text)
    • Custom text like | or ---
  • Include Page Breaks - Add page markers (e.g., "--- Page 2 ---") between pages. Default is false. Useful for tracking content location.

Output

  • Text - Extracted text content as a single string.
  • Character Count - Total number of characters in the extracted text.
  • Page Count - Number of pages in the document (if detectable).

How It Works

The Extract Text node provides a simplified text extraction workflow. When executed, the node:

  1. Validates the provided file path
  2. Automatically detects and parses the document format
  3. Extracts all text elements from the document
  4. Joins text with the specified separator
  5. Optionally adds page break markers
  6. Returns the complete text string with character and page counts

Requirements

  • Valid file path to a supported document format
  • Read access to the file
  • Sufficient memory for large documents

Error Handling

The node will return specific errors in the following cases:

  • Empty or missing file path
  • File not found at the specified path
  • Unsupported document format
  • Corrupted or invalid document file
  • Insufficient permissions to read the file

Usage Examples

Example 1: Basic Text Extraction

Extract text from a PDF for processing:

// Input
msg.filePath = "/documents/report.pdf"

// Extract Text output
msg.text = "Annual Report 2024\n\nExecutive Summary\n\nThis year has seen..."
msg.charCount = 15420
msg.pageCount = 12

Example 2: Extract with Page Markers

Enable page tracking in extracted text:

// With "Include Page Breaks" enabled
msg.text = "Introduction\n\nWelcome to our guide...\n\n--- Page 2 ---\n\nChapter 1..."

Use this to maintain page references when processing the text further.

Example 3: Custom Separator for Data Processing

Use space separator for compact text:

// Element Separator: " "
msg.text = "Title Paragraph 1 Paragraph 2 List item 1 List item 2"

Example 4: Process Multiple Documents

Loop through a folder of documents:

// In a Loop node over file paths
const files = ["/docs/file1.pdf", "/docs/file2.docx", "/docs/file3.txt"];

// Extract Text processes each file
// Collect results in an array
if (!msg.allTexts) msg.allTexts = [];
msg.allTexts.push({
filename: msg.filePath,
text: msg.text,
charCount: msg.charCount
});

Example 5: Quick Text Analysis

Combine with string operations:

// After Extract Text
const wordCount = msg.text.split(/\s+/).length;
const avgWordsPerPage = msg.pageCount > 0 ? wordCount / msg.pageCount : 0;

msg.stats = {
characters: msg.charCount,
words: wordCount,
pages: msg.pageCount,
avgWordsPerPage: Math.round(avgWordsPerPage)
};

// Output: { characters: 15420, words: 2850, pages: 12, avgWordsPerPage: 238 }

Tips for Effective Use

  1. When to Use Extract Text vs Read Document:

    • Use Extract Text when you only need the text content
    • Use Read Document when you need element types, structure, or selective extraction
    • Extract Text is generally faster for simple text extraction
  2. Separator Selection:

    • Use \n\n (default) for good readability and natural paragraph separation
    • Use single space for text that will be processed by NLP models
    • Use custom separators when you need to split the text later
  3. Page Break Markers:

    • Enable when you need to track which page content came from
    • Useful for citation and reference purposes
    • Keep disabled for cleaner text in most processing scenarios
  4. Memory Optimization:

    • For very large documents, consider the character count output to estimate memory usage
    • Process in batches if dealing with many large files
  5. For AI/RAG Workflows:

    • Extract Text output can be passed directly to Chunk Text node
    • Use double newline separator to help chunking algorithms identify paragraph boundaries

Common Errors and Solutions

Error: "File path is required"

Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.

Error: "File does not exist"

Cause: The specified file path is invalid or the file doesn't exist. Solution:

  • Check the file path for typos
  • Verify the file exists at that location
  • Use absolute paths to avoid ambiguity
  • Ensure the file hasn't been moved or deleted

Error: "Failed to extract text"

Cause: The document is corrupted, password-protected, or in an unsupported format. Solution:

  • Verify the file opens correctly in its native application
  • Remove password protection if present
  • Check that the file extension matches the actual file type
  • Try using Read Document node for more robust parsing

Warning: Page Count is 0

Cause: Page count is not available for the document type. Solution: This is normal for some formats like TXT or HTML. Only certain formats (PDF, DOCX) reliably provide page counts.

Integration with Other Nodes

Simple Text Processing Pipeline

File System: Read File → Extract Text → Function (process text) → Save Results

Text to Chunks for Embeddings

Extract Text → Chunk Text (Basic strategy) → Prepare for Embeddings → OpenAI

Multi-Document Processing

Loop (files) → Extract Text → Append to Array → Save Combined Results

Document Search and Filter

Loop (documents) → Extract Text → Function (search for keywords) → Filter → Process Matches

Performance Considerations

  • Speed: Extract Text is optimized for speed and is generally faster than Read Document
  • Memory: Text output is stored as a string; very large documents may use significant memory
  • File Size: Documents over 100MB may take considerable time to process
  • Format Impact: PDF processing is slower than plain text formats

Comparison with Read Document

FeatureExtract TextRead Document
SpeedFasterSlower (more processing)
OutputPlain text onlyStructured elements array
Element TypesNoYes (Title, NarrativeText, etc.)
MetadataNoYes (page numbers, coordinates)
Use CaseSimple text extractionStructure-aware processing
MemoryLowerHigher
Best ForText analysis, basic RAGSemantic chunking, selective extraction