Skip to main content

Extract Tables

Extracts table data from documents using Google Document AI with intelligent structure detection, preserving column headers and row values for automated data processing and analysis.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • File Path - The local file path of the document to process. Supports PDF, PNG, JPG, JPEG, TIFF, GIF, BMP, and WEBP formats.
  • MIME Type - The MIME type of the document (e.g., application/pdf, image/png, image/jpeg). If left empty, the MIME type will be automatically detected from the file content.

Output

  • Pages - An array of page objects, where each object contains:
    • page (number): The page number (starting from 1)
    • tables (array): Array of table objects found on that page, where each table contains:
      • columns (array): Array of column header names
      • rows (array): Array of row objects, where each row is a key-value map using column names as keys

Options

  • Credentials - Service Account credentials for Document AI API authentication. Select from Robomotion vault or provide the JSON key file content.
  • Project Id - The Google Cloud project ID where Document AI is enabled (e.g., "my-project-123").
  • Location - The processor location/region. Default is "us". Available options: "us", "eu", "asia". Choose the region where your processor is deployed.
  • Processor Id - The Document AI processor ID to use for table extraction (e.g., "a1b2c3d4e5f6g7h8"). Found in Google Cloud Console under Document AI processors.

How It Works

The Extract Tables node integrates with Google Document AI to extract tables from documents. When executed, the node:

  1. Validates the provided file path and checks file accessibility
  2. Detects MIME type automatically if not specified
  3. Authenticates with Google Document AI using the provided Service Account credentials
  4. Reads the document file content from the local file system
  5. Sends the document to the specified Document AI processor
  6. Processes the document using table detection machine learning models
  7. Identifies table boundaries, header rows, and body rows
  8. Extracts column headers and maps them to cell values
  9. Organizes results by page with structured table data
  10. Returns tables with columns array and rows array for easy processing

Requirements

  • Google Cloud Setup:
    • Active Google Cloud project with billing enabled
    • Document AI API enabled in the project
    • OCR or Form Parser processor created and deployed in Document AI
    • Service Account with Document AI User role
  • Robomotion Setup:
    • Service Account JSON key stored in Robomotion vault
    • Document file accessible on the local file system
    • File size under 20MB limit

Practical Examples

Example 1: Extract Invoice Line Items

// Extract tables from an invoice and process line items
// Outputs: Pages array with tables in $pages

$pages.forEach(page => {
console.log(`Processing Page ${page.page}`);

page.tables.forEach((table, tableIndex) => {
console.log(`Table ${tableIndex + 1}:`, table.columns);

table.rows.forEach((row, rowIndex) => {
console.log(`Row ${rowIndex + 1}:`);

// Access values by column name
const description = row['Description'] || row['Item'];
const quantity = row['Quantity'] || row['Qty'];
const price = row['Price'] || row['Unit Price'];
const total = row['Total'] || row['Amount'];

console.log(` ${description}: ${quantity} x $${price} = $${total}`);
});
});
});

Example 2: Convert Tables to Excel

// Extract tables and export to Excel format
const XLSX = require('xlsx');

const workbook = XLSX.utils.book_new();

$pages.forEach(page => {
page.tables.forEach((table, tableIndex) => {
// Convert table to array format for Excel
const data = [];

// Add header row
data.push(table.columns);

// Add data rows
table.rows.forEach(row => {
const rowArray = table.columns.map(col => row[col] || '');
data.push(rowArray);
});

// Create worksheet
const worksheet = XLSX.utils.aoa_to_sheet(data);
const sheetName = `Page${page.page}_Table${tableIndex + 1}`;

XLSX.utils.book_append_sheet(workbook, worksheet, sheetName);
});
});

// Write to file
XLSX.writeFile(workbook, 'extracted_tables.xlsx');
console.log('Tables exported to extracted_tables.xlsx');

Example 3: Aggregate Data from Multiple Tables

// Sum values across all tables in document
let totalAmount = 0;
let itemCount = 0;

$pages.forEach(page => {
page.tables.forEach(table => {
table.rows.forEach(row => {
// Extract numeric values from Amount or Total column
const amountStr = row['Amount'] || row['Total'] || '0';
const amount = parseFloat(amountStr.replace(/[$,]/g, ''));

if (!isNaN(amount)) {
totalAmount += amount;
itemCount++;
}
});
});
});

console.log(`Total items: ${itemCount}`);
console.log(`Total amount: $${totalAmount.toFixed(2)}`);

Example 4: Filter and Search Table Data

// Search for specific items in tables
const searchTerm = "Widget";
const matchingRows = [];

$pages.forEach(page => {
page.tables.forEach(table => {
table.rows.forEach(row => {
// Search across all columns
const rowText = Object.values(row).join(' ').toLowerCase();

if (rowText.includes(searchTerm.toLowerCase())) {
matchingRows.push({
page: page.page,
data: row
});
}
});
});
});

console.log(`Found ${matchingRows.length} matching rows:`);
matchingRows.forEach(match => {
console.log(`Page ${match.page}:`, match.data);
});

Example 5: Validate Table Structure

// Validate that tables have expected columns
const expectedColumns = ['Item', 'Quantity', 'Price', 'Total'];
const issues = [];

$pages.forEach(page => {
page.tables.forEach((table, tableIndex) => {
expectedColumns.forEach(expectedCol => {
const hasColumn = table.columns.some(col =>
col.toLowerCase().includes(expectedCol.toLowerCase())
);

if (!hasColumn) {
issues.push(`Page ${page.page}, Table ${tableIndex + 1}: Missing column '${expectedCol}'`);
}
});
});
});

if (issues.length > 0) {
console.warn('Validation issues:', issues);
} else {
console.log('All tables have expected columns');
}

Tips for Effective Use

Table Detection

  1. Clear Borders: Tables with visible borders are detected more accurately
  2. Consistent Formatting: Use consistent spacing and alignment within tables
  3. Header Row: Ensure the first row clearly indicates column headers
  4. Avoid Merged Cells: Complex merged cells may impact detection accuracy

Column Mapping

  • Column headers are used as keys for accessing row data
  • Handle variations in column names (e.g., "Qty" vs "Quantity")
  • Use case-insensitive matching when searching for columns
  • Check for null or empty values in cells

Multi-Table Documents

  • Process all tables in order they appear on the page
  • Track table index to identify specific tables
  • Consider table position when multiple tables have similar structure
  • Handle pages with no tables gracefully

Data Type Handling

  • Table cells are extracted as strings
  • Convert numeric values using parseFloat() or parseInt()
  • Remove currency symbols and formatting before conversion
  • Handle date formats consistently

Performance Optimization

  1. Process Only What's Needed: Filter pages with tables if possible
  2. Batch Processing: Group similar documents for processing
  3. Cache Results: Store extracted tables to avoid reprocessing
  4. Validate Early: Check for tables before full processing

Common Errors and Solutions

Error: "File path cannot be empty"

Cause: No file path provided to the node. Solution: Ensure the File Path input is populated with a valid path.

Error: "Failed to read document file"

Cause: File not found, permission denied, or path is incorrect. Solution:

  • Verify the file exists at the specified path
  • Check file permissions allow reading
  • Use absolute paths instead of relative paths
  • Ensure the file hasn't been moved or deleted

Error: "Project ID cannot be empty"

Cause: Project ID option is not configured. Solution: Add your Google Cloud project ID in the node options.

Error: "Processor ID cannot be empty"

Cause: Processor ID option is not configured. Solution:

  • Create a processor in Google Cloud Console
  • Copy the processor ID from the processor details page
  • Add it to the node options

Error: "Invalid credentials format: missing content field"

Cause: Credentials are not properly formatted or stored. Solution:

  • Re-download Service Account JSON key from Google Cloud Console
  • Save the complete JSON content in Robomotion vault
  • Select the correct credential from the dropdown

No tables found in output

Cause: Document doesn't contain detectable tables or tables are not recognized. Solution:

  • Verify document contains actual tables (not just text arranged in columns)
  • Use tables with clear borders and structure
  • Check if document quality is sufficient (not blurry or skewed)
  • Try using a specialized processor for better table detection

Incorrect column mapping

Cause: Headers not properly detected or merged cells causing confusion. Solution:

  • Ensure first row clearly indicates headers
  • Avoid complex header structures with merged cells
  • Manually map columns if headers are inconsistent
  • Use column index instead of name for problematic tables

Missing cell values

Cause: Empty cells or detection issues in sparse tables. Solution:

  • Check for null or undefined values before accessing
  • Use default values for missing cells
  • Verify document doesn't have faint or light-colored text
  • Increase scan resolution for better detection

Use Cases

Invoice Processing

Extract line items, quantities, prices, and totals from vendor invoices for automated accounts payable processing.

Financial Reports

Extract tables from financial statements, balance sheets, and P&L reports for analysis and comparison.

Inventory Management

Process inventory lists, stock reports, and warehouse documents to update inventory systems.

Price List Updates

Extract product catalogs and price lists to update e-commerce systems or pricing databases.

Timesheet Processing

Extract employee timesheets with hours worked, tasks, and billing information for payroll processing.

Expense Reports

Process expense report tables with dates, categories, amounts, and descriptions for reimbursement workflows.

Data Migration

Extract tabular data from legacy documents during system migrations or data consolidation projects.

Compliance Reporting

Extract data tables from regulatory filings and compliance documents for auditing and reporting.

Output Structure Example

// Example output structure
{
pages: [
{
page: 1,
tables: [
{
columns: ['Item', 'Quantity', 'Price', 'Total'],
rows: [
{
'Item': 'Widget A',
'Quantity': '5',
'Price': '$10.00',
'Total': '$50.00'
},
{
'Item': 'Widget B',
'Quantity': '3',
'Price': '$15.00',
'Total': '$45.00'
}
]
}
]
}
]
}

Performance Considerations

  • Processing Time: Typically 2-15 seconds per page depending on table complexity
  • Table Complexity: Documents with many tables take longer to process
  • Cell Count: Tables with hundreds of cells may impact processing speed
  • Concurrent Requests: Implement queuing for batch processing to respect rate limits
  • Network Latency: Choose processor location near your automation deployment
  • Cost: Charged per page processed; monitor usage in Google Cloud Console