Skip to main content

Process Document

Performs OCR on documents and exports to various formats (PDF, DOCX, TXT, etc.) using ABBYY FineReader Engine.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

  • Path - Path to the input document or image file to process.
  • Out Path - Path where the processed document will be exported.

Options

  • Language - Language for OCR text recognition (default: English). Supports 200+ languages including all major European languages, Chinese, Japanese, Korean, Arabic, and many more.
  • Profile - Processing profile that defines quality and speed tradeoffs (default: DocumentConversion_Accuracy). Options:
    • DocumentConversion_Accuracy - Highest quality, best for editable documents
    • DocumentConversion_Speed - Faster processing with good quality
    • DocumentArchiving_Accuracy - Optimized for searchable archives
    • DocumentArchiving_Speed - Fast archival processing
    • TextExtraction_Accuracy - Best for text extraction only
    • TextExtraction_Speed - Fast text extraction
  • Text Type - Type of text in the document (default: Normal). Options:
    • Normal - Regular printed text
    • Typewriter - Typewriter or monospaced fonts
    • Matrix - Dot matrix printer text
    • Handprinted - Hand-printed text
    • Gothic - Gothic or blackletter fonts
    • OCR-A - OCR-A font
  • Export Format - Output file format (default: Searchable PDF). Options:
    • Text Document - Plain TXT file
    • Rich Text Format - RTF with formatting
    • MS Word Document - DOCX format
    • MS Excel Document - XLSX spreadsheet
    • MS PowerPoint Document - PPTX presentation
    • Searchable PDF - PDF with searchable text layer
    • PDF with Text and Images - Standard PDF format
    • PDF/A - Archival PDF format
    • XML - Structured XML output
  • Correct Skew - Whether to automatically correct skewed text lines (default: Auto). Options:
    • False - No correction
    • True - Always correct
    • Auto - Automatically decide
  • Correct Orientation - Whether to automatically detect and correct page orientation (default: false).

Outputs

This node produces an output file at the specified Out Path location.

How It Works

The Process Document node performs comprehensive OCR on documents. When executed, the node:

  1. Validates the input file exists
  2. Loads the predefined processing profile
  3. Creates an ABBYY FRDocument and adds the input image
  4. Preprocesses the image (orientation and skew correction)
  5. Recognizes text using the specified language and text type
  6. Exports the result in the selected format
  7. Saves the output to the specified path

Requirements

  • Valid ABBYY FineReader Engine installation
  • Valid ABBYY license
  • Input file must exist and be readable
  • Supported input formats: JPG, PNG, BMP, TIFF, PDF
  • Output directory must exist and be writable

Error Handling

The node will return specific errors in the following cases:

  • ErrNotFound - Input document file not found. Error message includes the file path to help locate the issue.

Usage Example

Scenario: Convert a scanned invoice to searchable PDF

Process Document node:
- Path: "C:/invoices/invoice_2024_001.jpg"
- Out Path: "C:/processed/invoice_2024_001.pdf"
- Language: English
- Profile: DocumentConversion_Accuracy
- Text Type: Normal
- Export Format: Searchable PDF
- Correct Skew: Auto
- Correct Orientation: true

Scenario: Extract text from typewriter document

Process Document node:
- Path: "C:/old_docs/contract_1985.tiff"
- Out Path: "C:/text/contract_1985.txt"
- Language: English
- Profile: TextExtraction_Accuracy
- Text Type: Typewriter
- Export Format: Text Document
- Correct Skew: True
- Correct Orientation: false

Scenario: Create editable Word document from scan

Process Document node:
- Path: "C:/scans/report.pdf"
- Out Path: "C:/editable/report.docx"
- Language: English
- Profile: DocumentConversion_Accuracy
- Text Type: Normal
- Export Format: MS Word Document
- Correct Skew: Auto
- Correct Orientation: true

Common Use Cases

  • Document Digitization - Convert paper documents to searchable digital formats
  • Archive Creation - Build searchable document archives in PDF/A format
  • Text Extraction - Extract plain text from scanned documents
  • Format Conversion - Convert between different document formats
  • OCR for Editing - Create editable documents (DOCX) from scans
  • Data Extraction - Extract text for further processing
  • Legal Documents - Process contracts and legal papers with high accuracy
  • Historical Documents - OCR old typewriter or printed documents

Tips and Best Practices

  • Profile Selection:
    • Use "Accuracy" profiles for important documents
    • Use "Speed" profiles for large batches with time constraints
    • "DocumentConversion" for editable output
    • "TextExtraction" when only text is needed
    • "DocumentArchiving" for long-term storage
  • Language Selection:
    • Always specify the correct document language
    • Supports multiple languages in one document
    • Wrong language severely impacts accuracy
    • Use language codes for non-English documents
  • Text Type:
    • Match text type to document characteristics
    • "Normal" works for most modern documents
    • "Typewriter" for old typed documents
    • "Matrix" for dot matrix printouts
    • "Handprinted" for hand-filled forms
  • Export Format:
    • Use PDF for distribution and archival
    • Use DOCX for editing and reformatting
    • Use TXT for simple text extraction
    • Use XLSX for tabular data
    • Use XML for programmatic processing
  • Image Preprocessing:
    • Enable "Correct Orientation" for mixed orientations
    • Use "Auto" for skew correction (works well)
    • Enable both for unknown document conditions
    • Corrections add minimal processing time
  • Input Quality:
    • Use 300 DPI or higher for scanning
    • Ensure good lighting and contrast
    • Clean documents before scanning
    • Avoid shadows and glare
  • Performance:
    • Processing time depends on document size and complexity
    • Accuracy profiles take longer than speed profiles
    • Multi-page documents process sequentially
    • Consider batch processing overnight for large volumes
  • Output Path:
    • Ensure output directory exists
    • Use appropriate file extension for format
    • Overwriting existing files is supported
    • Check disk space for large documents
  • Error Handling:
    • Enable Continue On Error for batch processing
    • Validate input files before processing
    • Check output file was created successfully
    • Log errors for troubleshooting
  • Quality Validation:
    • Spot-check OCR results for accuracy
    • Compare with original for critical documents
    • Review uncertain characters
    • Consider manual review for important documents