Extract Key Values
Extracts key-value pairs from forms and structured documents using Google Document AI's form parsing capabilities, automatically identifying field labels and their corresponding values for intelligent data extraction.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- File Path - The local file path of the document to process. Supports PDF, PNG, JPG, JPEG, TIFF, GIF, BMP, and WEBP formats.
- MIME Type - The MIME type of the document (e.g.,
application/pdf,image/png,image/jpeg). If left empty, the MIME type will be automatically detected from the file content.
Output
- Pages - An array of page objects, where each object contains:
page(number): The page number (starting from 1)key_value_pairs(array): Array of key-value pair objects, where each object contains:key(string): The field label or form field namevalue(string): The corresponding value or filled-in content
Options
- Credentials - Service Account credentials for Document AI API authentication. Select from Robomotion vault or provide the JSON key file content.
- Project Id - The Google Cloud project ID where Document AI is enabled (e.g., "my-project-123").
- Location - The processor location/region. Default is "us". Available options: "us", "eu", "asia". Choose the region where your processor is deployed.
- Processor Id - The Document AI processor ID to use for form parsing (e.g., "a1b2c3d4e5f6g7h8"). Found in Google Cloud Console under Document AI processors.
How It Works
The Extract Key Values node integrates with Google Document AI to extract form fields from documents. When executed, the node:
- Validates the provided file path and checks file accessibility
- Detects MIME type automatically if not specified
- Authenticates with Google Document AI using the provided Service Account credentials
- Reads the document file content from the local file system
- Sends the document to the specified Document AI processor
- Processes the document using form parsing machine learning models
- Identifies form fields, labels, and their spatial relationships
- Extracts key-value pairs by matching field labels to their values
- Handles checkboxes, text fields, and various form elements
- Organizes results by page with structured key-value pairs
- Returns extracted data ready for processing or database storage
Requirements
- Google Cloud Setup:
- Active Google Cloud project with billing enabled
- Document AI API enabled in the project
- Form Parser processor created and deployed in Document AI
- Service Account with Document AI User role
- Robomotion Setup:
- Service Account JSON key stored in Robomotion vault
- Document file accessible on the local file system
- File size under 20MB limit
Practical Examples
Example 1: Extract Invoice Fields
// Extract key fields from an invoice
// Outputs: Pages array with key-value pairs in $pages
$pages.forEach(page => {
console.log(`Processing Page ${page.page}`);
// Convert array to object for easier access
const fields = {};
page.key_value_pairs.forEach(pair => {
fields[pair.key] = pair.value;
});
// Extract common invoice fields
const invoiceNumber = fields['Invoice Number'] || fields['Invoice #'];
const invoiceDate = fields['Date'] || fields['Invoice Date'];
const dueDate = fields['Due Date'] || fields['Payment Due'];
const total = fields['Total'] || fields['Amount Due'];
const vendor = fields['Vendor'] || fields['From'];
console.log('Invoice Details:');
console.log(` Number: ${invoiceNumber}`);
console.log(` Date: ${invoiceDate}`);
console.log(` Due Date: ${dueDate}`);
console.log(` Total: ${total}`);
console.log(` Vendor: ${vendor}`);
});
Example 2: Process Application Form
// Extract data from application form
const applicantData = {};
$pages.forEach(page => {
page.key_value_pairs.forEach(pair => {
// Normalize key names
const key = pair.key.trim().toLowerCase().replace(/[:\s]+/g, '_');
applicantData[key] = pair.value.trim();
});
});
// Access extracted data
console.log('Applicant Information:');
console.log(`Name: ${applicantData.full_name || applicantData.name}`);
console.log(`Email: ${applicantData.email || applicantData.email_address}`);
console.log(`Phone: ${applicantData.phone || applicantData.phone_number}`);
console.log(`Address: ${applicantData.address}`);
console.log(`Date of Birth: ${applicantData.date_of_birth || applicantData.dob}`);
// Validate required fields
const requiredFields = ['name', 'email', 'phone'];
const missing = requiredFields.filter(field => !applicantData[field]);
if (missing.length > 0) {
console.warn(`Missing required fields: ${missing.join(', ')}`);
}
Example 3: Save to Database
// Extract form data and save to database
const db = require('./database'); // Your database module
for (const page of $pages) {
const record = {
page_number: page.page,
extracted_at: new Date(),
fields: {}
};
// Build fields object
page.key_value_pairs.forEach(pair => {
record.fields[pair.key] = pair.value;
});
// Save to database
await db.insert('form_submissions', record);
console.log(`Saved page ${page.page} to database`);
}
Example 4: Compare Form Values
// Compare extracted values against expected values
const expectedValues = {
'Status': 'Approved',
'Reviewed By': 'John Smith',
'Signature': 'Yes'
};
const discrepancies = [];
$pages.forEach(page => {
page.key_value_pairs.forEach(pair => {
if (expectedValues[pair.key]) {
if (pair.value !== expectedValues[pair.key]) {
discrepancies.push({
field: pair.key,
expected: expectedValues[pair.key],
actual: pair.value
});
}
}
});
});
if (discrepancies.length > 0) {
console.warn('Found discrepancies:');
discrepancies.forEach(d => {
console.log(` ${d.field}: expected "${d.expected}", got "${d.actual}"`);
});
} else {
console.log('All values match expected values');
}
Example 5: Generate Summary Report
// Generate summary of all extracted key-value pairs
let report = 'Extracted Form Data\n';
report += '==================\n\n';
$pages.forEach(page => {
report += `Page ${page.page}:\n`;
report += '-'.repeat(50) + '\n';
if (page.key_value_pairs.length === 0) {
report += ' No fields found\n';
} else {
page.key_value_pairs.forEach(pair => {
const key = pair.key.padEnd(30);
report += ` ${key}: ${pair.value}\n`;
});
}
report += '\n';
});
// Save report to file
const fs = require('fs');
fs.writeFileSync('extraction_report.txt', report);
console.log('Report saved to extraction_report.txt');
Example 6: Handle Multi-Page Forms
// Combine data from multi-page form
const combinedData = {};
$pages.forEach(page => {
console.log(`Processing page ${page.page}/${$pages.length}`);
page.key_value_pairs.forEach(pair => {
// Handle duplicate keys across pages
if (combinedData[pair.key]) {
// If key exists, convert to array or append
if (Array.isArray(combinedData[pair.key])) {
combinedData[pair.key].push(pair.value);
} else {
combinedData[pair.key] = [combinedData[pair.key], pair.value];
}
} else {
combinedData[pair.key] = pair.value;
}
});
});
console.log('Combined form data:', combinedData);
Tips for Effective Use
Form Design
- Clear Labels: Use clear, distinct labels for form fields
- Consistent Layout: Maintain consistent spacing between labels and values
- Standard Fonts: Use standard, readable fonts (avoid decorative fonts)
- Label Proximity: Keep labels close to their corresponding value fields
Key Matching
- Keys are extracted as-is from the document
- Handle variations in label text (e.g., "Phone" vs "Phone Number")
- Use case-insensitive matching when searching for specific fields
- Trim whitespace from keys and values
- Normalize punctuation in field names
Checkbox Handling
- Checkboxes are detected and values extracted (often "Yes"/"No" or checked/unchecked)
- Handle boolean conversions from text values
- Consider common checkbox representations ("X", "☑", "checked", etc.)
Multi-Page Forms
- Process all pages to capture complete form data
- Handle fields that span multiple pages
- Combine or aggregate data appropriately
- Track which page each field came from if needed
Data Validation
- Required Fields: Check for presence of required fields
- Format Validation: Validate dates, emails, phone numbers, etc.
- Value Ranges: Verify numeric values are within expected ranges
- Cross-Field Validation: Ensure related fields are consistent
Common Errors and Solutions
Error: "File path cannot be empty"
Cause: No file path provided to the node. Solution: Ensure the File Path input is populated with a valid path.
Error: "Failed to read document file"
Cause: File not found, permission denied, or path is incorrect. Solution:
- Verify the file exists at the specified path
- Check file permissions allow reading
- Use absolute paths instead of relative paths
- Ensure the file hasn't been moved or deleted
Error: "Project ID cannot be empty"
Cause: Project ID option is not configured. Solution: Add your Google Cloud project ID in the node options.
Error: "Processor ID cannot be empty"
Cause: Processor ID option is not configured. Solution:
- Create a Form Parser processor in Google Cloud Console
- Copy the processor ID from the processor details page
- Add it to the node options
Error: "Invalid credentials format: missing content field"
Cause: Credentials are not properly formatted or stored. Solution:
- Re-download Service Account JSON key from Google Cloud Console
- Save the complete JSON content in Robomotion vault
- Select the correct credential from the dropdown
No key-value pairs found
Cause: Document doesn't contain detectable form fields or layout isn't recognized. Solution:
- Verify document contains actual form fields (labels with associated values)
- Ensure labels and values are properly aligned and spaced
- Check document quality (not blurry or skewed)
- Try using a specialized processor for the document type
- Consider using Extract Text if structure is too complex
Incorrect key-value pairing
Cause: Spatial relationship between label and value not detected correctly. Solution:
- Ensure consistent spacing between labels and values
- Avoid complex layouts with multiple columns
- Use clear visual separation between fields
- Check for proper alignment of labels and values
- Manually map fields if automated extraction is inconsistent
Missing values for some fields
Cause: Empty fields or low-contrast text not detected. Solution:
- Verify all fields are filled in the original document
- Check for sufficient contrast between text and background
- Ensure handwritten values are clear and legible
- Increase scan resolution for better detection
- Use default values for optional empty fields
Use Cases
Invoice Processing
Extract vendor information, invoice numbers, dates, amounts, and payment terms from invoices for automated accounts payable workflows.
Form Digitization
Convert paper forms (applications, registrations, surveys) to structured digital data for database entry and processing.
Insurance Claims
Extract policy numbers, claim dates, claimant information, and claim details from insurance claim forms for automated processing.
Medical Records
Extract patient information, medical history, diagnoses, and prescriptions from medical forms and health records.
Tax Documents
Extract taxpayer information, income details, deductions, and other data from tax forms (W2, 1099, etc.).
Loan Applications
Extract applicant details, employment information, financial data, and supporting documentation from loan application forms.
Compliance Documents
Extract required information from regulatory forms, certifications, and compliance documentation for auditing and reporting.
Contract Data
Extract key terms, parties, dates, amounts, and conditions from contracts and agreements for contract management systems.
Output Structure Example
// Example output structure
{
pages: [
{
page: 1,
key_value_pairs: [
{
key: 'Invoice Number',
value: 'INV-2024-001'
},
{
key: 'Date',
value: '01/15/2024'
},
{
key: 'Total Amount',
value: '$1,250.00'
},
{
key: 'Payment Terms',
value: 'Net 30'
},
{
key: 'Vendor Name',
value: 'Acme Corporation'
}
]
}
]
}
Field Name Normalization
Consider normalizing field names for consistency:
// Normalize field names
function normalizeFieldName(name) {
return name
.toLowerCase()
.trim()
.replace(/[:\s]+/g, '_')
.replace(/[^a-z0-9_]/g, '');
}
// Use normalized names
const normalizedData = {};
$pages.forEach(page => {
page.key_value_pairs.forEach(pair => {
const normalizedKey = normalizeFieldName(pair.key);
normalizedData[normalizedKey] = pair.value;
});
});
Performance Considerations
- Processing Time: Typically 2-10 seconds per page depending on form complexity
- Field Count: Documents with many fields take slightly longer to process
- Layout Complexity: Complex multi-column layouts may impact accuracy
- Concurrent Requests: Implement queuing for batch processing to respect rate limits
- Network Latency: Choose processor location near your automation deployment
- Cost: Charged per page processed; monitor usage in Google Cloud Console