Skip to main content

Classify Document

Classifies documents using a trained ABBYY classification model.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

  • Model Path - Path to the trained classification model file (.mdc file created by Train Model node).
  • File Path - Path to the document file to classify.

Options

  • Correct Orientation - Whether to automatically detect and correct document orientation before classification (default: false).

Outputs

  • Label - Classification result label assigned to the document based on the trained model.

How It Works

The Classify Document node categorizes documents using machine learning. When executed, the node:

  1. Loads the trained classification model from file
  2. Determines if text recognition is needed (based on model type)
  3. Creates an ABBYY FRDocument and loads the input file
  4. Optionally preprocesses to correct orientation
  5. Analyzes and recognizes text if required by the model
  6. Classifies the document using the loaded model
  7. Returns the category label with highest confidence

Requirements

  • Valid ABBYY FineReader Engine installation with Classification Engine
  • Valid ABBYY license
  • Trained classification model file (.mdc)
  • Input document file must exist
  • Model must be trained on similar document types

Error Handling

The node will return errors if:

  • Model file not found or invalid
  • Input document file not found
  • Classification engine initialization fails

Usage Example

Scenario: Classify incoming invoices by vendor

1. Train Model (one-time setup):
- Create model for vendors: VendorA, VendorB, VendorC

2. Classify Document node:
- Model Path: "C:/models/vendor_classifier.mdc"
- File Path: "C:/inbox/invoice_001.pdf"
- Correct Orientation: true
- Output: {{ $.label }} (e.g., "VendorA")

3. Route document based on label

Scenario: Classify documents by type

Classify Document node:
- Model Path: "C:/models/document_types.mdc"
- File Path: "C:/documents/unknown_doc.pdf"
- Correct Orientation: true
- Output: {{ $.label }} (e.g., "Invoice", "Contract", "Receipt")

Scenario: Automated document sorting

Loop through documents:
1. Classify Document:
- Model Path: "C:/models/sorter.mdc"
- File Path: {{ $.current_doc }}

2. Decision based on {{ $.label }}:
- "Invoice" → Move to invoices folder
- "Contract" → Move to contracts folder
- "Receipt" → Move to receipts folder
- "Not classified" → Move to review folder

Common Use Cases

  • Document Routing - Automatically route documents to appropriate departments
  • Invoice Processing - Classify invoices by vendor or type
  • Email Attachment Sorting - Classify and organize email attachments
  • Archive Organization - Categorize documents for archival
  • Quality Control - Identify document types for validation
  • Compliance Checking - Classify documents by regulatory category
  • Automated Workflows - Trigger different workflows based on document type
  • Content Management - Auto-tag documents in content management systems

Classification Types

Image-Based Classification

  • Uses visual features of the document
  • Fast classification without OCR
  • Good for documents with consistent layouts
  • Examples: Forms, templates, letterheads

Text-Based Classification

  • Uses OCR to extract and analyze text content
  • More accurate for text-heavy documents
  • Slower due to OCR requirement
  • Examples: Contracts, reports, letters

Combined Classification

  • Uses both image and text features
  • Best accuracy for mixed documents
  • Balances speed and accuracy
  • Recommended for general use

Classification Results

The output label represents:

  • Category Name - The label from training data
  • Highest Confidence - Category with highest match score
  • "Not classified" - Document doesn't match any trained category well

Tips and Best Practices

  • Model Selection:
    • Use model trained on similar document types
    • Retrain model if classification accuracy is poor
    • Models are document-specific (don't reuse across different domains)
    • Keep models updated with new document variations
  • Orientation Correction:
    • Enable if documents may be rotated
    • Adds minimal processing time
    • Critical for scanned document batches
    • Not needed for digital documents
  • Model Training:
    • Train with representative samples
    • Include variations of each category
    • More training samples = better accuracy
    • Test model before production use
  • Classification Accuracy:
    • Depends on model quality
    • Depends on document similarity to training
    • Check confidence scores if available
    • Implement manual review for low confidence
  • Performance:
    • Image-based: Very fast (< 1 second)
    • Text-based: Slower (OCR required, 3-5 seconds)
    • Combined: Medium speed (2-4 seconds)
    • Processing time depends on model type
  • Error Handling:
    • Enable Continue On Error for batch processing
    • Handle "Not classified" results
    • Log classification results for monitoring
    • Flag low-confidence classifications for review
  • Workflow Integration:
    • Classify before detailed processing
    • Use label to route to appropriate workflow
    • Combine with other nodes based on type
    • Track classification accuracy over time
  • Batch Processing:
    • Classify multiple documents in loop
    • Track success rates by category
    • Identify documents needing manual review
    • Update model based on misclassifications
  • Quality Assurance:
    • Spot-check classifications regularly
    • Monitor "Not classified" frequency
    • Review misclassified documents
    • Retrain model when accuracy drops
    • Keep training data updated
  • Common Issues:
    • Poor classification → Retrain model with more samples
    • "Not classified" → Document type not in training
    • Wrong category → Add similar documents to training
    • Slow performance → Consider image-based model
  • Best Practices:
    • Start with small category set (3-5 types)
    • Expand categories gradually
    • Test thoroughly before production
    • Monitor and log all classifications
    • Maintain model version control
    • Document category definitions clearly
    • Implement feedback loop for improvements
  • Document Preparation:
    • Consistent image quality helps
    • Standard resolution (300 DPI)
    • Clean scans without artifacts
    • Correct orientation before classification
  • Model Management:
    • Store models in version control
    • Document model training parameters
    • Track model performance metrics
    • Plan for model updates
    • Test new models before deployment
  • Advanced Usage:
    • Combine multiple classifiers (hierarchical)
    • Use confidence thresholds for routing
    • Implement multi-label classification
    • Track classification statistics
    • Build classification dashboards