Skip to main content

Split Document

Splits multi-page documents into individual page image files using ABBYY FineReader Engine.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

  • Path - Path to the multi-page document or image file to split.
  • Out Path - Base path for output files (page numbers will be appended automatically).

Options

This node has no configurable options.

Outputs

This node produces multiple output image files, one for each page in the document. Files are named sequentially.

How It Works

The Split Document node separates multi-page files into individual images. When executed, the node:

  1. Validates the input file exists
  2. Creates an ABBYY FRDocument and loads all pages
  3. Iterates through each page in the document
  4. Extracts the image from each page
  5. Saves each page as a separate image file
  6. Appends page numbers to the base filename
  7. Closes handles and cleans up resources

Requirements

  • Valid ABBYY FineReader Engine installation
  • Valid ABBYY license
  • Input file must exist and contain multiple pages
  • Supported formats: PDF, TIFF, multi-page TIFF
  • Output directory must exist and be writable

Error Handling

The node will return specific errors in the following cases:

  • ErrNotFound - Input document file not found. Error message includes the path.

Usage Example

Scenario: Split a multi-page PDF into individual JPG files

Split Document node:
- Path: "C:/documents/contract.pdf"
- Out Path: "C:/pages/contract_page.jpg"

Output files:
- C:/pages/contract_page.jpg (page 1)
- C:/pages/contract_page_2.jpg (page 2)
- C:/pages/contract_page_3.jpg (page 3)
- ...

Scenario: Split a scanned multi-page TIFF

Split Document node:
- Path: "C:/scans/report.tiff"
- Out Path: "C:/individual/page.png"

Output files:
- C:/individual/page.png (page 1)
- C:/individual/page_2.png (page 2)
- C:/individual/page_3.png (page 3)
- ...

Scenario: Process each page separately after splitting

1. Split Document node:
- Path: "C:/docs/book.pdf"
- Out Path: "C:/pages/page.jpg"

2. Get list of created files

3. Loop through pages:
- Process each page individually
- Apply different processing based on page number
- Extract specific data from each page

Common Use Cases

  • Page-by-Page Processing - Process each page with different settings
  • Selective Page Extraction - Extract and process only certain pages
  • Parallel Processing - Process multiple pages in parallel
  • Page Classification - Classify or route pages individually
  • Multi-Document Splitting - Separate combined scans into individual documents
  • PDF Conversion - Convert PDF pages to image format
  • Archive Preparation - Prepare pages for different archive systems
  • Batch Processing - Process document collections page-by-page

Output File Naming

First Page

  • Uses the base filename as-is
  • Example: contract_page.jpg

Subsequent Pages

  • Appends page number with underscore
  • Example: contract_page_2.jpg, contract_page_3.jpg
  • Numbering starts at 2 for the second page

File Extension

  • Determined by Out Path extension
  • Supports: JPG, PNG, BMP, TIFF
  • Use extension matching your needs

Tips and Best Practices

  • Input Formats:
    • Works with multi-page PDF files
    • Supports multi-page TIFF images
    • Single-page documents produce one output file
    • Verify input format is supported
  • Output Format:
    • Use JPG for photographs and general use
    • Use PNG for screenshots and graphics
    • Use TIFF for archival quality
    • Match extension in Out Path
  • File Naming:
    • Choose descriptive base names
    • Include document identifier
    • Consider sequential numbering in name
    • Avoid special characters in paths
  • Disk Space:
    • Each page becomes a separate file
    • Image files can be large (especially PNG/TIFF)
    • Estimate: ~500KB - 5MB per page
    • Ensure sufficient disk space
  • Performance:
    • Splitting is relatively fast
    • Time depends on page count and size
    • No OCR performed (images only)
    • Sequential processing (one page at a time)
  • Memory Usage:
    • Processes one page at a time
    • Memory efficient for large documents
    • Handles properly closed and released
    • Suitable for hundreds of pages
  • Workflow Integration:
    • Split first, then process pages individually
    • Enables parallel processing of pages
    • Allows selective page processing
    • Combine results after individual processing
  • Page Count:
    • No limit on page count
    • Works with 1 to 1000+ pages
    • Consider processing time for large documents
    • Monitor output directory size
  • Quality:
    • Output preserves original image quality
    • No compression or quality loss (with PNG/TIFF)
    • JPG uses default compression
    • Original resolution maintained
  • Use Cases by Industry:
    • Legal: Split contracts and agreements
    • Healthcare: Separate medical record pages
    • Finance: Split statements and reports
    • HR: Separate application packages
    • Education: Split scanned textbooks
  • After Splitting:
    • Track which pages came from which document
    • Store original document reference
    • Consider page reassembly strategy
    • Implement cleanup for temporary files
  • Error Handling:
    • Enable Continue On Error for batch jobs
    • Verify all pages were created
    • Count output files matches expected pages
    • Handle incomplete splits
  • Automation Patterns:
    • Split → Classify → Route pages
    • Split → Process individually → Combine results
    • Split → Extract data → Database insert
    • Split → OCR → Index → Archive
  • Best Practices:
    • Clean up split files after processing
    • Log page count and file sizes
    • Verify split completed successfully
    • Implement error recovery
    • Consider temporary directory for splits
  • Limitations:
    • Extracts images only (no OCR)
    • Sequential processing (not parallel)
    • All pages use same output format
    • Page numbering is automatic (not customizable)