Skip to main content

Train Model

Trains a document classification model using labeled training data with ABBYY FineReader Engine.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

  • Training Data - Training dataset object containing labeled document samples with categories and file paths.
  • Model Path - Path where the trained model file will be saved (.mdc file).

Options

  • Language - Language for text recognition in documents (default: English). Required for text-based and combined classifiers.
  • Classifier Type - Type of classifier algorithm to use (default: Combined). Options:
    • Image - Uses only visual features (fast, layout-based)
    • Text - Uses only text content (slower, content-based)
    • Combined - Uses both image and text features (recommended)
    • ImageAndText - Advanced combined approach
  • Training Mode - Training mode that controls model complexity and accuracy (default: Normal). Options:
    • Fast - Quick training, simpler model
    • Normal - Balanced training time and accuracy
    • FullSearch - Thorough training, best accuracy (slowest)
  • Correct Orientation - Whether to automatically detect and correct document orientation during training (default: false).
  • Allow Cross Validation - Whether to use cross-validation to evaluate model accuracy (default: true).
  • Folds Count - Number of folds to use for cross-validation (default: 3). Higher values give more accurate estimates but take longer.

Outputs

This node saves a trained classification model file to the specified Model Path. The model can be used with the Classify Document node.

How It Works

The Train Model node creates a machine learning classifier. When executed, the node:

  1. Receives training data with labeled document samples
  2. Configures processing parameters (language, orientation)
  3. Adds all training documents to the trainer
  4. Trains the model using the specified classifier type and mode
  5. Optionally performs cross-validation to assess accuracy
  6. Saves the trained model to the specified path

Requirements

  • Valid ABBYY FineReader Engine installation with Classification Engine
  • Valid ABBYY license
  • Training data object with labeled samples
  • At least 5-10 samples per category (more is better)
  • Output directory must exist and be writable

Training Data Format

The training data object should contain:

  • Categories - List of category names (labels)
  • Samples - List of sample documents with:
    • File path to document
    • Category label
    • Optional metadata

Example structure:

{
"categories": ["Invoice", "Contract", "Receipt"],
"samples": [
{"path": "C:/train/inv_001.pdf", "label": "Invoice"},
{"path": "C:/train/inv_002.pdf", "label": "Invoice"},
{"path": "C:/train/con_001.pdf", "label": "Contract"},
...
]
}

Error Handling

The node will return errors if:

  • Training data is invalid or empty
  • Insufficient samples per category
  • Document files not found
  • Model cannot be saved to specified path

Usage Example

Scenario: Train an invoice vendor classifier

1. Prepare training data:
- Collect 20+ invoices from each vendor
- Label each invoice with vendor name
- Create training data object

2. Train Model node:
- Training Data: {{ $.training_data }}
- Model Path: "C:/models/vendor_classifier.mdc"
- Language: English
- Classifier Type: Combined
- Training Mode: Normal
- Correct Orientation: true
- Allow Cross Validation: true
- Folds Count: 5

3. Review cross-validation results
4. Use model with Classify Document node

Scenario: Fast image-based form classifier

Train Model node:
- Training Data: {{ $.form_samples }}
- Model Path: "C:/models/form_types.mdc"
- Classifier Type: Image
- Training Mode: Fast
- Correct Orientation: true
- Allow Cross Validation: true
- Folds Count: 3

Common Use Cases

  • Document Type Classification - Train to identify invoices, contracts, receipts, etc.
  • Vendor Classification - Identify documents by vendor or supplier
  • Department Routing - Classify by destination department
  • Language Detection - Classify documents by language
  • Form Type Recognition - Identify different form types
  • Quality Classification - Classify by document quality or completeness
  • Compliance Categories - Classify by regulatory category
  • Content Categories - Classify by subject matter or topic

Classifier Types

Image-Based

  • Pros: Very fast classification, no OCR needed
  • Cons: Only works with visually distinct documents
  • Best for: Forms, templates, letterheads
  • Speed: < 1 second per document

Text-Based

  • Pros: Works with content differences
  • Cons: Slower (requires OCR), language-dependent
  • Best for: Letters, reports, articles
  • Speed: 3-5 seconds per document
  • Pros: Best accuracy, works with most documents
  • Cons: Medium speed, requires good training
  • Best for: Mixed document types, general use
  • Speed: 2-4 seconds per document

Training Modes

Fast

  • Quick training (minutes)
  • Simpler model
  • Good for initial testing
  • May have lower accuracy
  • Balanced training time
  • Good accuracy for most cases
  • 10-30 minutes typical
  • Recommended for production

FullSearch

  • Thorough training
  • Best possible accuracy
  • May take hours
  • Use for critical applications

Cross-Validation

Cross-validation assesses model accuracy:

  • Splits training data into folds
  • Trains on some folds, tests on others
  • Repeats with different fold combinations
  • Reports average accuracy

Folds Count:

  • 3 folds: Fast, less precise estimate
  • 5 folds: Balanced (recommended)
  • 10 folds: Slow, very precise estimate

Tips and Best Practices

  • Training Data:
    • Minimum: 5-10 samples per category
    • Recommended: 20-50 samples per category
    • More samples = better accuracy
    • Include variations within each category
    • Balance sample counts across categories
    • Use representative real-world documents
  • Category Design:
    • Start with 3-5 clear categories
    • Categories should be distinct
    • Avoid overlapping definitions
    • Define clear category criteria
    • Document category examples
  • Classifier Type:
    • Use "Combined" unless you have specific needs
    • Use "Image" for forms with fixed layouts
    • Use "Text" for content-based classification
    • Test different types if unsure
  • Training Mode:
    • Start with "Normal" for most cases
    • Use "Fast" for prototyping and testing
    • Use "FullSearch" for critical production models
    • Consider training time vs. accuracy needs
  • Language Setting:
    • Match the language of your documents
    • Required for text-based classifiers
    • Not used for image-only classification
    • Supports multilingual models
  • Orientation Correction:
    • Enable if training documents vary in orientation
    • Critical for scanned document batches
    • Adds to training time
    • Improves model robustness
  • Cross-Validation:
    • Always enable for accuracy assessment
    • Use results to validate model quality
    • High accuracy (greater than 90%) = good model
    • Low accuracy (less than 70%) = need more/better training data
    • Use 3-5 folds for typical use
  • Model Quality:
    • Review cross-validation accuracy
    • Test on separate validation set
    • Aim for over 85% accuracy for production
    • Retrain if accuracy is insufficient
    • Update model periodically with new samples
  • Performance:
    • Training is one-time operation
    • Can take minutes to hours
    • Run during off-hours if needed
    • Save multiple model versions
  • Model Management:
    • Use descriptive model file names
    • Include version numbers or dates
    • Store in version control
    • Document training parameters
    • Keep training data archived
  • Iterative Improvement:
    • Start with small training set
    • Test and validate
    • Add more samples where needed
    • Focus on misclassified documents
    • Retrain and test again
  • Common Issues:
    • Low accuracy → Add more diverse training samples
    • Imbalanced categories → Balance sample counts
    • Long training time → Reduce samples or use Fast mode
    • Model too large → Reduce samples or categories
  • Production Deployment:
    • Train on development environment
    • Validate thoroughly before production
    • Test on real-world documents
    • Monitor classification accuracy
    • Plan for model updates
    • Keep training pipeline documented
  • Advanced Techniques:
    • Ensemble multiple models
    • Hierarchical classification (coarse → fine)
    • Active learning (add misclassified to training)
    • Regular model retraining schedule
    • A/B testing of model versions