Classify Document
Classifies documents using a trained ABBYY classification model.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
Inputs
- Model Path - Path to the trained classification model file (.mdc file created by Train Model node).
- File Path - Path to the document file to classify.
Options
- Correct Orientation - Whether to automatically detect and correct document orientation before classification (default: false).
Outputs
- Label - Classification result label assigned to the document based on the trained model.
How It Works
The Classify Document node categorizes documents using machine learning. When executed, the node:
- Loads the trained classification model from file
- Determines if text recognition is needed (based on model type)
- Creates an ABBYY FRDocument and loads the input file
- Optionally preprocesses to correct orientation
- Analyzes and recognizes text if required by the model
- Classifies the document using the loaded model
- Returns the category label with highest confidence
Requirements
- Valid ABBYY FineReader Engine installation with Classification Engine
- Valid ABBYY license
- Trained classification model file (.mdc)
- Input document file must exist
- Model must be trained on similar document types
Error Handling
The node will return errors if:
- Model file not found or invalid
- Input document file not found
- Classification engine initialization fails
Usage Example
Scenario: Classify incoming invoices by vendor
1. Train Model (one-time setup):
- Create model for vendors: VendorA, VendorB, VendorC
2. Classify Document node:
- Model Path: "C:/models/vendor_classifier.mdc"
- File Path: "C:/inbox/invoice_001.pdf"
- Correct Orientation: true
- Output: {{ $.label }} (e.g., "VendorA")
3. Route document based on label
Scenario: Classify documents by type
Classify Document node:
- Model Path: "C:/models/document_types.mdc"
- File Path: "C:/documents/unknown_doc.pdf"
- Correct Orientation: true
- Output: {{ $.label }} (e.g., "Invoice", "Contract", "Receipt")
Scenario: Automated document sorting
Loop through documents:
1. Classify Document:
- Model Path: "C:/models/sorter.mdc"
- File Path: {{ $.current_doc }}
2. Decision based on {{ $.label }}:
- "Invoice" → Move to invoices folder
- "Contract" → Move to contracts folder
- "Receipt" → Move to receipts folder
- "Not classified" → Move to review folder
Common Use Cases
- Document Routing - Automatically route documents to appropriate departments
- Invoice Processing - Classify invoices by vendor or type
- Email Attachment Sorting - Classify and organize email attachments
- Archive Organization - Categorize documents for archival
- Quality Control - Identify document types for validation
- Compliance Checking - Classify documents by regulatory category
- Automated Workflows - Trigger different workflows based on document type
- Content Management - Auto-tag documents in content management systems
Classification Types
Image-Based Classification
- Uses visual features of the document
- Fast classification without OCR
- Good for documents with consistent layouts
- Examples: Forms, templates, letterheads
Text-Based Classification
- Uses OCR to extract and analyze text content
- More accurate for text-heavy documents
- Slower due to OCR requirement
- Examples: Contracts, reports, letters
Combined Classification
- Uses both image and text features
- Best accuracy for mixed documents
- Balances speed and accuracy
- Recommended for general use
Classification Results
The output label represents:
- Category Name - The label from training data
- Highest Confidence - Category with highest match score
- "Not classified" - Document doesn't match any trained category well
Tips and Best Practices
- Model Selection:
- Use model trained on similar document types
- Retrain model if classification accuracy is poor
- Models are document-specific (don't reuse across different domains)
- Keep models updated with new document variations
- Orientation Correction:
- Enable if documents may be rotated
- Adds minimal processing time
- Critical for scanned document batches
- Not needed for digital documents
- Model Training:
- Train with representative samples
- Include variations of each category
- More training samples = better accuracy
- Test model before production use
- Classification Accuracy:
- Depends on model quality
- Depends on document similarity to training
- Check confidence scores if available
- Implement manual review for low confidence
- Performance:
- Image-based: Very fast (< 1 second)
- Text-based: Slower (OCR required, 3-5 seconds)
- Combined: Medium speed (2-4 seconds)
- Processing time depends on model type
- Error Handling:
- Enable Continue On Error for batch processing
- Handle "Not classified" results
- Log classification results for monitoring
- Flag low-confidence classifications for review
- Workflow Integration:
- Classify before detailed processing
- Use label to route to appropriate workflow
- Combine with other nodes based on type
- Track classification accuracy over time
- Batch Processing:
- Classify multiple documents in loop
- Track success rates by category
- Identify documents needing manual review
- Update model based on misclassifications
- Quality Assurance:
- Spot-check classifications regularly
- Monitor "Not classified" frequency
- Review misclassified documents
- Retrain model when accuracy drops
- Keep training data updated
- Common Issues:
- Poor classification → Retrain model with more samples
- "Not classified" → Document type not in training
- Wrong category → Add similar documents to training
- Slow performance → Consider image-based model
- Best Practices:
- Start with small category set (3-5 types)
- Expand categories gradually
- Test thoroughly before production
- Monitor and log all classifications
- Maintain model version control
- Document category definitions clearly
- Implement feedback loop for improvements
- Document Preparation:
- Consistent image quality helps
- Standard resolution (300 DPI)
- Clean scans without artifacts
- Correct orientation before classification
- Model Management:
- Store models in version control
- Document model training parameters
- Track model performance metrics
- Plan for model updates
- Test new models before deployment
- Advanced Usage:
- Combine multiple classifiers (hierarchical)
- Use confidence thresholds for routing
- Implement multi-label classification
- Track classification statistics
- Build classification dashboards