Extract Text
Extracts text content from documents using Google Document AI's OCR and text recognition capabilities, converting PDFs, images, and scanned documents into searchable, machine-readable text.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- File Path - The local file path of the document to process. Supports PDF, PNG, JPG, JPEG, TIFF, GIF, BMP, and WEBP formats.
- MIME Type - The MIME type of the document (e.g.,
application/pdf,image/png,image/jpeg). If left empty, the MIME type will be automatically detected from the file content.
Output
- Text - The complete extracted text from all pages concatenated together as a single string. Preserves the reading order of the original document.
- Pages - An array of page objects, where each object contains:
page(number): The page number (starting from 1)text(string): The extracted text content for that specific page
Options
- Credentials - Service Account credentials for Document AI API authentication. Select from Robomotion vault or provide the JSON key file content.
- Project Id - The Google Cloud project ID where Document AI is enabled (e.g., "my-project-123").
- Location - The processor location/region. Default is "us". Available options: "us", "eu", "asia". Choose the region where your processor is deployed.
- Processor Id - The Document AI processor ID to use for text extraction (e.g., "a1b2c3d4e5f6g7h8"). Found in Google Cloud Console under Document AI processors.
How It Works
The Extract Text node integrates with Google Document AI to extract text from documents. When executed, the node:
- Validates the provided file path and checks file accessibility
- Detects MIME type automatically if not specified
- Authenticates with Google Document AI using the provided Service Account credentials
- Reads the document file content from the local file system
- Sends the document to the specified Document AI processor
- Processes the document using OCR and machine learning models
- Extracts text while preserving the reading order and layout structure
- Organizes results by page and returns both full text and page-level text
- Sets output variables for use in subsequent flow nodes
Requirements
- Google Cloud Setup:
- Active Google Cloud project with billing enabled
- Document AI API enabled in the project
- OCR processor created and deployed in Document AI
- Service Account with Document AI User role
- Robomotion Setup:
- Service Account JSON key stored in Robomotion vault
- Document file accessible on the local file system
- File size under 20MB limit
Practical Examples
Example 1: Extract Text from Scanned Invoice
// Extract text from a scanned invoice PDF
// Outputs: Full text in $text, pages array in $pages
// Access full text
const fullText = $text;
console.log("Extracted Text:", fullText);
// Search for invoice number in text
const invoiceMatch = fullText.match(/Invoice #:\s*(\w+)/);
if (invoiceMatch) {
const invoiceNumber = invoiceMatch[1];
console.log("Invoice Number:", invoiceNumber);
}
// Process each page separately
$pages.forEach(page => {
console.log(`Page ${page.page}:`, page.text);
});
Example 2: Convert Multi-Page PDF to Text Files
// Extract text and save each page to separate file
const fs = require('fs');
$pages.forEach(page => {
const fileName = `page_${page.page}.txt`;
fs.writeFileSync(fileName, page.text);
console.log(`Saved ${fileName}`);
});
// Also save the full document
fs.writeFileSync('full_document.txt', $text);
Example 3: Search and Validate Document Content
// Extract and validate required fields from a form
const requiredFields = ['Name:', 'Date:', 'Signature:'];
const missingFields = [];
requiredFields.forEach(field => {
if (!$text.includes(field)) {
missingFields.push(field);
}
});
if (missingFields.length > 0) {
throw new Error(`Missing fields: ${missingFields.join(', ')}`);
}
console.log("All required fields present");
Example 4: Extract Text from Image and Process
// Extract text from image document (e.g., photographed receipt)
// File Path: /downloads/receipt.jpg
// MIME Type: (auto-detected)
// Clean and process extracted text
const cleanText = $text
.replace(/\n+/g, ' ') // Replace newlines with spaces
.replace(/\s+/g, ' ') // Normalize whitespace
.trim();
// Extract total amount using regex
const totalMatch = cleanText.match(/Total:?\s*\$?([\d,]+\.?\d*)/i);
if (totalMatch) {
const total = parseFloat(totalMatch[1].replace(',', ''));
console.log("Total Amount:", total);
}
Tips for Effective Use
Document Preparation
- Image Quality: Use high-resolution scans (300 DPI or higher) for best OCR accuracy
- File Format: PDF format generally provides better results than images
- Orientation: Ensure documents are properly oriented (not rotated or upside down)
- Lighting: For photographed documents, use even lighting without shadows or glare
MIME Type Handling
- Auto-detection works well for most cases
- Explicitly specify MIME type for better performance when processing many files
- Common MIME types:
- PDF:
application/pdf - PNG:
image/png - JPEG:
image/jpeg - TIFF:
image/tiff
- PDF:
Processing Large Documents
- Documents with more than 15 pages may require splitting
- Consider processing page ranges separately for very large documents
- Monitor processing time (default timeout is 240 seconds)
Text Cleanup
- Extracted text may contain extra whitespace or line breaks
- Use string manipulation to normalize formatting
- Consider using regular expressions for pattern matching
- Handle special characters and encoding properly
Error Prevention
- Verify File Paths: Always check that file exists before processing
- Check File Size: Keep documents under 20MB limit
- Validate Credentials: Test credentials with a simple document first
- Region Matching: Ensure processor location matches the region in your Project ID
Common Errors and Solutions
Error: "File path cannot be empty"
Cause: No file path provided to the node. Solution: Ensure the File Path input is populated with a valid path.
Error: "Failed to read document file"
Cause: File not found, permission denied, or path is incorrect. Solution:
- Verify the file exists at the specified path
- Check file permissions allow reading
- Use absolute paths instead of relative paths
- Ensure the file hasn't been moved or deleted
Error: "Project ID cannot be empty"
Cause: Project ID option is not configured. Solution: Add your Google Cloud project ID in the node options.
Error: "Processor ID cannot be empty"
Cause: Processor ID option is not configured. Solution:
- Create a processor in Google Cloud Console
- Copy the processor ID from the processor details page
- Add it to the node options
Error: "Invalid credentials format: missing content field"
Cause: Credentials are not properly formatted or stored. Solution:
- Re-download Service Account JSON key from Google Cloud Console
- Save the complete JSON content in Robomotion vault
- Select the correct credential from the dropdown
Error: "Failed to process document: permission denied"
Cause: Service Account lacks necessary permissions. Solution:
- Ensure Service Account has "Document AI API User" role
- Verify Document AI API is enabled in the project
- Check that the processor ID belongs to the specified project
Error: "Request timeout"
Cause: Document processing took longer than timeout period. Solution:
- Reduce document size or page count
- Check network connectivity
- Try processing during off-peak hours
- Contact support if timeout persists with normal documents
Use Cases
Document Digitization
Convert paper archives, historical documents, or legacy files into searchable digital text for indexing and retrieval.
Content Extraction
Extract text from contracts, agreements, or legal documents for analysis, comparison, or storage in databases.
Receipt Processing
Convert receipt images to text for expense tracking, accounting systems, or financial analysis.
Form Processing
Extract text from filled forms for data entry automation, validation, or integration with business systems.
Academic Research
Process research papers, books, or manuscripts for text analysis, citation extraction, or digital preservation.
Accessibility
Convert scanned documents or images to text for screen readers and accessibility tools.
Performance Considerations
- Processing Time: Typically 2-10 seconds per page depending on complexity
- Concurrent Requests: Document AI has rate limits; implement queuing for batch processing
- File Size: Larger files take longer; consider splitting very large PDFs
- Network Latency: Choose processor location near your automation deployment
- Cost: Charged per page processed; monitor usage in Google Cloud Console