Process Document
Performs OCR on documents and exports to various formats (PDF, DOCX, TXT, etc.) using ABBYY FineReader Engine.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
Inputs
- Path - Path to the input document or image file to process.
- Out Path - Path where the processed document will be exported.
Options
- Language - Language for OCR text recognition (default: English). Supports 200+ languages including all major European languages, Chinese, Japanese, Korean, Arabic, and many more.
- Profile - Processing profile that defines quality and speed tradeoffs (default: DocumentConversion_Accuracy). Options:
- DocumentConversion_Accuracy - Highest quality, best for editable documents
- DocumentConversion_Speed - Faster processing with good quality
- DocumentArchiving_Accuracy - Optimized for searchable archives
- DocumentArchiving_Speed - Fast archival processing
- TextExtraction_Accuracy - Best for text extraction only
- TextExtraction_Speed - Fast text extraction
- Text Type - Type of text in the document (default: Normal). Options:
- Normal - Regular printed text
- Typewriter - Typewriter or monospaced fonts
- Matrix - Dot matrix printer text
- Handprinted - Hand-printed text
- Gothic - Gothic or blackletter fonts
- OCR-A - OCR-A font
- Export Format - Output file format (default: Searchable PDF). Options:
- Text Document - Plain TXT file
- Rich Text Format - RTF with formatting
- MS Word Document - DOCX format
- MS Excel Document - XLSX spreadsheet
- MS PowerPoint Document - PPTX presentation
- Searchable PDF - PDF with searchable text layer
- PDF with Text and Images - Standard PDF format
- PDF/A - Archival PDF format
- XML - Structured XML output
- Correct Skew - Whether to automatically correct skewed text lines (default: Auto). Options:
- False - No correction
- True - Always correct
- Auto - Automatically decide
- Correct Orientation - Whether to automatically detect and correct page orientation (default: false).
Outputs
This node produces an output file at the specified Out Path location.
How It Works
The Process Document node performs comprehensive OCR on documents. When executed, the node:
- Validates the input file exists
- Loads the predefined processing profile
- Creates an ABBYY FRDocument and adds the input image
- Preprocesses the image (orientation and skew correction)
- Recognizes text using the specified language and text type
- Exports the result in the selected format
- Saves the output to the specified path
Requirements
- Valid ABBYY FineReader Engine installation
- Valid ABBYY license
- Input file must exist and be readable
- Supported input formats: JPG, PNG, BMP, TIFF, PDF
- Output directory must exist and be writable
Error Handling
The node will return specific errors in the following cases:
ErrNotFound- Input document file not found. Error message includes the file path to help locate the issue.
Usage Example
Scenario: Convert a scanned invoice to searchable PDF
Process Document node:
- Path: "C:/invoices/invoice_2024_001.jpg"
- Out Path: "C:/processed/invoice_2024_001.pdf"
- Language: English
- Profile: DocumentConversion_Accuracy
- Text Type: Normal
- Export Format: Searchable PDF
- Correct Skew: Auto
- Correct Orientation: true
Scenario: Extract text from typewriter document
Process Document node:
- Path: "C:/old_docs/contract_1985.tiff"
- Out Path: "C:/text/contract_1985.txt"
- Language: English
- Profile: TextExtraction_Accuracy
- Text Type: Typewriter
- Export Format: Text Document
- Correct Skew: True
- Correct Orientation: false
Scenario: Create editable Word document from scan
Process Document node:
- Path: "C:/scans/report.pdf"
- Out Path: "C:/editable/report.docx"
- Language: English
- Profile: DocumentConversion_Accuracy
- Text Type: Normal
- Export Format: MS Word Document
- Correct Skew: Auto
- Correct Orientation: true
Common Use Cases
- Document Digitization - Convert paper documents to searchable digital formats
- Archive Creation - Build searchable document archives in PDF/A format
- Text Extraction - Extract plain text from scanned documents
- Format Conversion - Convert between different document formats
- OCR for Editing - Create editable documents (DOCX) from scans
- Data Extraction - Extract text for further processing
- Legal Documents - Process contracts and legal papers with high accuracy
- Historical Documents - OCR old typewriter or printed documents
Tips and Best Practices
- Profile Selection:
- Use "Accuracy" profiles for important documents
- Use "Speed" profiles for large batches with time constraints
- "DocumentConversion" for editable output
- "TextExtraction" when only text is needed
- "DocumentArchiving" for long-term storage
- Language Selection:
- Always specify the correct document language
- Supports multiple languages in one document
- Wrong language severely impacts accuracy
- Use language codes for non-English documents
- Text Type:
- Match text type to document characteristics
- "Normal" works for most modern documents
- "Typewriter" for old typed documents
- "Matrix" for dot matrix printouts
- "Handprinted" for hand-filled forms
- Export Format:
- Use PDF for distribution and archival
- Use DOCX for editing and reformatting
- Use TXT for simple text extraction
- Use XLSX for tabular data
- Use XML for programmatic processing
- Image Preprocessing:
- Enable "Correct Orientation" for mixed orientations
- Use "Auto" for skew correction (works well)
- Enable both for unknown document conditions
- Corrections add minimal processing time
- Input Quality:
- Use 300 DPI or higher for scanning
- Ensure good lighting and contrast
- Clean documents before scanning
- Avoid shadows and glare
- Performance:
- Processing time depends on document size and complexity
- Accuracy profiles take longer than speed profiles
- Multi-page documents process sequentially
- Consider batch processing overnight for large volumes
- Output Path:
- Ensure output directory exists
- Use appropriate file extension for format
- Overwriting existing files is supported
- Check disk space for large documents
- Error Handling:
- Enable Continue On Error for batch processing
- Validate input files before processing
- Check output file was created successfully
- Log errors for troubleshooting
- Quality Validation:
- Spot-check OCR results for accuracy
- Compare with original for critical documents
- Review uncertain characters
- Consider manual review for important documents