Train Model
Trains a document classification model using labeled training data with ABBYY FineReader Engine.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
Inputs
- Training Data - Training dataset object containing labeled document samples with categories and file paths.
- Model Path - Path where the trained model file will be saved (.mdc file).
Options
- Language - Language for text recognition in documents (default: English). Required for text-based and combined classifiers.
- Classifier Type - Type of classifier algorithm to use (default: Combined). Options:
- Image - Uses only visual features (fast, layout-based)
- Text - Uses only text content (slower, content-based)
- Combined - Uses both image and text features (recommended)
- ImageAndText - Advanced combined approach
- Training Mode - Training mode that controls model complexity and accuracy (default: Normal). Options:
- Fast - Quick training, simpler model
- Normal - Balanced training time and accuracy
- FullSearch - Thorough training, best accuracy (slowest)
- Correct Orientation - Whether to automatically detect and correct document orientation during training (default: false).
- Allow Cross Validation - Whether to use cross-validation to evaluate model accuracy (default: true).
- Folds Count - Number of folds to use for cross-validation (default: 3). Higher values give more accurate estimates but take longer.
Outputs
This node saves a trained classification model file to the specified Model Path. The model can be used with the Classify Document node.
How It Works
The Train Model node creates a machine learning classifier. When executed, the node:
- Receives training data with labeled document samples
- Configures processing parameters (language, orientation)
- Adds all training documents to the trainer
- Trains the model using the specified classifier type and mode
- Optionally performs cross-validation to assess accuracy
- Saves the trained model to the specified path
Requirements
- Valid ABBYY FineReader Engine installation with Classification Engine
- Valid ABBYY license
- Training data object with labeled samples
- At least 5-10 samples per category (more is better)
- Output directory must exist and be writable
Training Data Format
The training data object should contain:
- Categories - List of category names (labels)
- Samples - List of sample documents with:
- File path to document
- Category label
- Optional metadata
Example structure:
{
"categories": ["Invoice", "Contract", "Receipt"],
"samples": [
{"path": "C:/train/inv_001.pdf", "label": "Invoice"},
{"path": "C:/train/inv_002.pdf", "label": "Invoice"},
{"path": "C:/train/con_001.pdf", "label": "Contract"},
...
]
}
Error Handling
The node will return errors if:
- Training data is invalid or empty
- Insufficient samples per category
- Document files not found
- Model cannot be saved to specified path
Usage Example
Scenario: Train an invoice vendor classifier
1. Prepare training data:
- Collect 20+ invoices from each vendor
- Label each invoice with vendor name
- Create training data object
2. Train Model node:
- Training Data: {{ $.training_data }}
- Model Path: "C:/models/vendor_classifier.mdc"
- Language: English
- Classifier Type: Combined
- Training Mode: Normal
- Correct Orientation: true
- Allow Cross Validation: true
- Folds Count: 5
3. Review cross-validation results
4. Use model with Classify Document node
Scenario: Fast image-based form classifier
Train Model node:
- Training Data: {{ $.form_samples }}
- Model Path: "C:/models/form_types.mdc"
- Classifier Type: Image
- Training Mode: Fast
- Correct Orientation: true
- Allow Cross Validation: true
- Folds Count: 3
Common Use Cases
- Document Type Classification - Train to identify invoices, contracts, receipts, etc.
- Vendor Classification - Identify documents by vendor or supplier
- Department Routing - Classify by destination department
- Language Detection - Classify documents by language
- Form Type Recognition - Identify different form types
- Quality Classification - Classify by document quality or completeness
- Compliance Categories - Classify by regulatory category
- Content Categories - Classify by subject matter or topic
Classifier Types
Image-Based
- Pros: Very fast classification, no OCR needed
- Cons: Only works with visually distinct documents
- Best for: Forms, templates, letterheads
- Speed: < 1 second per document
Text-Based
- Pros: Works with content differences
- Cons: Slower (requires OCR), language-dependent
- Best for: Letters, reports, articles
- Speed: 3-5 seconds per document
Combined (Recommended)
- Pros: Best accuracy, works with most documents
- Cons: Medium speed, requires good training
- Best for: Mixed document types, general use
- Speed: 2-4 seconds per document
Training Modes
Fast
- Quick training (minutes)
- Simpler model
- Good for initial testing
- May have lower accuracy
Normal (Recommended)
- Balanced training time
- Good accuracy for most cases
- 10-30 minutes typical
- Recommended for production
FullSearch
- Thorough training
- Best possible accuracy
- May take hours
- Use for critical applications
Cross-Validation
Cross-validation assesses model accuracy:
- Splits training data into folds
- Trains on some folds, tests on others
- Repeats with different fold combinations
- Reports average accuracy
Folds Count:
- 3 folds: Fast, less precise estimate
- 5 folds: Balanced (recommended)
- 10 folds: Slow, very precise estimate
Tips and Best Practices
- Training Data:
- Minimum: 5-10 samples per category
- Recommended: 20-50 samples per category
- More samples = better accuracy
- Include variations within each category
- Balance sample counts across categories
- Use representative real-world documents
- Category Design:
- Start with 3-5 clear categories
- Categories should be distinct
- Avoid overlapping definitions
- Define clear category criteria
- Document category examples
- Classifier Type:
- Use "Combined" unless you have specific needs
- Use "Image" for forms with fixed layouts
- Use "Text" for content-based classification
- Test different types if unsure
- Training Mode:
- Start with "Normal" for most cases
- Use "Fast" for prototyping and testing
- Use "FullSearch" for critical production models
- Consider training time vs. accuracy needs
- Language Setting:
- Match the language of your documents
- Required for text-based classifiers
- Not used for image-only classification
- Supports multilingual models
- Orientation Correction:
- Enable if training documents vary in orientation
- Critical for scanned document batches
- Adds to training time
- Improves model robustness
- Cross-Validation:
- Always enable for accuracy assessment
- Use results to validate model quality
- High accuracy (greater than 90%) = good model
- Low accuracy (less than 70%) = need more/better training data
- Use 3-5 folds for typical use
- Model Quality:
- Review cross-validation accuracy
- Test on separate validation set
- Aim for over 85% accuracy for production
- Retrain if accuracy is insufficient
- Update model periodically with new samples
- Performance:
- Training is one-time operation
- Can take minutes to hours
- Run during off-hours if needed
- Save multiple model versions
- Model Management:
- Use descriptive model file names
- Include version numbers or dates
- Store in version control
- Document training parameters
- Keep training data archived
- Iterative Improvement:
- Start with small training set
- Test and validate
- Add more samples where needed
- Focus on misclassified documents
- Retrain and test again
- Common Issues:
- Low accuracy → Add more diverse training samples
- Imbalanced categories → Balance sample counts
- Long training time → Reduce samples or use Fast mode
- Model too large → Reduce samples or categories
- Production Deployment:
- Train on development environment
- Validate thoroughly before production
- Test on real-world documents
- Monitor classification accuracy
- Plan for model updates
- Keep training pipeline documented
- Advanced Techniques:
- Ensemble multiple models
- Hierarchical classification (coarse → fine)
- Active learning (add misclassified to training)
- Regular model retraining schedule
- A/B testing of model versions