Train Model

Trains a document classification model using labeled training data with ABBYY FineReader Engine.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

Training Data - Training dataset object containing labeled document samples with categories and file paths.
Model Path - Path where the trained model file will be saved (.mdc file).

Options

Language - Language for text recognition in documents (default: English). Required for text-based and combined classifiers.
Classifier Type - Type of classifier algorithm to use (default: Combined). Options:
- Image - Uses only visual features (fast, layout-based)
- Text - Uses only text content (slower, content-based)
- Combined - Uses both image and text features (recommended)
- ImageAndText - Advanced combined approach
Training Mode - Training mode that controls model complexity and accuracy (default: Normal). Options:
- Fast - Quick training, simpler model
- Normal - Balanced training time and accuracy
- FullSearch - Thorough training, best accuracy (slowest)
Correct Orientation - Whether to automatically detect and correct document orientation during training (default: false).
Allow Cross Validation - Whether to use cross-validation to evaluate model accuracy (default: true).
Folds Count - Number of folds to use for cross-validation (default: 3). Higher values give more accurate estimates but take longer.

Outputs

This node saves a trained classification model file to the specified Model Path. The model can be used with the Classify Document node.

How It Works

The Train Model node creates a machine learning classifier. When executed, the node:

Receives training data with labeled document samples
Configures processing parameters (language, orientation)
Adds all training documents to the trainer
Trains the model using the specified classifier type and mode
Optionally performs cross-validation to assess accuracy
Saves the trained model to the specified path

Requirements

Valid ABBYY FineReader Engine installation with Classification Engine
Valid ABBYY license
Training data object with labeled samples
At least 5-10 samples per category (more is better)
Output directory must exist and be writable

Training Data Format

The training data object should contain:

Categories - List of category names (labels)
Samples - List of sample documents with:
- File path to document
- Category label
- Optional metadata

Example structure:

{
  "categories": ["Invoice", "Contract", "Receipt"],
  "samples": [
    {"path": "C:/train/inv_001.pdf", "label": "Invoice"},
    {"path": "C:/train/inv_002.pdf", "label": "Invoice"},
    {"path": "C:/train/con_001.pdf", "label": "Contract"},
    ...
  ]
}

Error Handling

The node will return errors if:

Training data is invalid or empty
Insufficient samples per category
Document files not found
Model cannot be saved to specified path

Usage Example

Scenario: Train an invoice vendor classifier

1. Prepare training data:
   - Collect 20+ invoices from each vendor
   - Label each invoice with vendor name
   - Create training data object

2. Train Model node:
   - Training Data: {{ $.training_data }}
   - Model Path: "C:/models/vendor_classifier.mdc"
   - Language: English
   - Classifier Type: Combined
   - Training Mode: Normal
   - Correct Orientation: true
   - Allow Cross Validation: true
   - Folds Count: 5

3. Review cross-validation results
4. Use model with Classify Document node

Scenario: Fast image-based form classifier

Train Model node:
- Training Data: {{ $.form_samples }}
- Model Path: "C:/models/form_types.mdc"
- Classifier Type: Image
- Training Mode: Fast
- Correct Orientation: true
- Allow Cross Validation: true
- Folds Count: 3

Common Use Cases

Document Type Classification - Train to identify invoices, contracts, receipts, etc.
Vendor Classification - Identify documents by vendor or supplier
Department Routing - Classify by destination department
Language Detection - Classify documents by language
Form Type Recognition - Identify different form types
Quality Classification - Classify by document quality or completeness
Compliance Categories - Classify by regulatory category
Content Categories - Classify by subject matter or topic

Classifier Types

Image-Based

Pros: Very fast classification, no OCR needed
Cons: Only works with visually distinct documents
Best for: Forms, templates, letterheads
Speed: < 1 second per document

Text-Based

Pros: Works with content differences
Cons: Slower (requires OCR), language-dependent
Best for: Letters, reports, articles
Speed: 3-5 seconds per document

Combined (Recommended)

Pros: Best accuracy, works with most documents
Cons: Medium speed, requires good training
Best for: Mixed document types, general use
Speed: 2-4 seconds per document

Training Modes

Fast

Quick training (minutes)
Simpler model
Good for initial testing
May have lower accuracy

Normal (Recommended)

Balanced training time
Good accuracy for most cases
10-30 minutes typical
Recommended for production

FullSearch

Thorough training
Best possible accuracy
May take hours
Use for critical applications

Cross-Validation

Cross-validation assesses model accuracy:

Splits training data into folds
Trains on some folds, tests on others
Repeats with different fold combinations
Reports average accuracy

Folds Count:

3 folds: Fast, less precise estimate
5 folds: Balanced (recommended)
10 folds: Slow, very precise estimate

Tips and Best Practices

Training Data:
- Minimum: 5-10 samples per category
- Recommended: 20-50 samples per category
- More samples = better accuracy
- Include variations within each category
- Balance sample counts across categories
- Use representative real-world documents
Category Design:
- Start with 3-5 clear categories
- Categories should be distinct
- Avoid overlapping definitions
- Define clear category criteria
- Document category examples
Classifier Type:
- Use "Combined" unless you have specific needs
- Use "Image" for forms with fixed layouts
- Use "Text" for content-based classification
- Test different types if unsure
Training Mode:
- Start with "Normal" for most cases
- Use "Fast" for prototyping and testing
- Use "FullSearch" for critical production models
- Consider training time vs. accuracy needs
Language Setting:
- Match the language of your documents
- Required for text-based classifiers
- Not used for image-only classification
- Supports multilingual models
Orientation Correction:
- Enable if training documents vary in orientation
- Critical for scanned document batches
- Adds to training time
- Improves model robustness
Cross-Validation:
- Always enable for accuracy assessment
- Use results to validate model quality
- High accuracy (greater than 90%) = good model
- Low accuracy (less than 70%) = need more/better training data
- Use 3-5 folds for typical use
Model Quality:
- Review cross-validation accuracy
- Test on separate validation set
- Aim for over 85% accuracy for production
- Retrain if accuracy is insufficient
- Update model periodically with new samples
Performance:
- Training is one-time operation
- Can take minutes to hours
- Run during off-hours if needed
- Save multiple model versions
Model Management:
- Use descriptive model file names
- Include version numbers or dates
- Store in version control
- Document training parameters
- Keep training data archived
Iterative Improvement:
- Start with small training set
- Test and validate
- Add more samples where needed
- Focus on misclassified documents
- Retrain and test again
Common Issues:
- Low accuracy → Add more diverse training samples
- Imbalanced categories → Balance sample counts
- Long training time → Reduce samples or use Fast mode
- Model too large → Reduce samples or categories
Production Deployment:
- Train on development environment
- Validate thoroughly before production
- Test on real-world documents
- Monitor classification accuracy
- Plan for model updates
- Keep training pipeline documented
Advanced Techniques:
- Ensemble multiple models
- Hierarchical classification (coarse → fine)
- Active learning (add misclassified to training)
- Regular model retraining schedule
- A/B testing of model versions

Common Properties​

Inputs​

Options​

Outputs​

How It Works​

Requirements​

Training Data Format​

Error Handling​

Usage Example​

Common Use Cases​

Classifier Types​

Image-Based​

Text-Based​

Combined (Recommended)​

Training Modes​

Fast​

Normal (Recommended)​

FullSearch​

Cross-Validation​

Tips and Best Practices​