Extract Tables

Extracts table data from documents using Google Document AI with intelligent structure detection, preserving column headers and row values for automated data processing and analysis.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

File Path - The local file path of the document to process. Supports PDF, PNG, JPG, JPEG, TIFF, GIF, BMP, and WEBP formats.
MIME Type - The MIME type of the document (e.g., application/pdf, image/png, image/jpeg). If left empty, the MIME type will be automatically detected from the file content.

Output

Pages - An array of page objects, where each object contains:
- page (number): The page number (starting from 1)
- tables (array): Array of table objects found on that page, where each table contains:
  - columns (array): Array of column header names
  - rows (array): Array of row objects, where each row is a key-value map using column names as keys

Options

Credentials - Service Account credentials for Document AI API authentication. Select from Robomotion vault or provide the JSON key file content.
Project Id - The Google Cloud project ID where Document AI is enabled (e.g., "my-project-123").
Location - The processor location/region. Default is "us". Available options: "us", "eu", "asia". Choose the region where your processor is deployed.
Processor Id - The Document AI processor ID to use for table extraction (e.g., "a1b2c3d4e5f6g7h8"). Found in Google Cloud Console under Document AI processors.

How It Works

The Extract Tables node integrates with Google Document AI to extract tables from documents. When executed, the node:

Validates the provided file path and checks file accessibility
Detects MIME type automatically if not specified
Authenticates with Google Document AI using the provided Service Account credentials
Reads the document file content from the local file system
Sends the document to the specified Document AI processor
Processes the document using table detection machine learning models
Identifies table boundaries, header rows, and body rows
Extracts column headers and maps them to cell values
Organizes results by page with structured table data
Returns tables with columns array and rows array for easy processing

Requirements

Google Cloud Setup:
- Active Google Cloud project with billing enabled
- Document AI API enabled in the project
- OCR or Form Parser processor created and deployed in Document AI
- Service Account with Document AI User role
Robomotion Setup:
- Service Account JSON key stored in Robomotion vault
- Document file accessible on the local file system
- File size under 20MB limit

Practical Examples

Example 1: Extract Invoice Line Items

// Extract tables from an invoice and process line items
// Outputs: Pages array with tables in $pages

$pages.forEach(page => {
  console.log(`Processing Page ${page.page}`);

  page.tables.forEach((table, tableIndex) => {
    console.log(`Table ${tableIndex + 1}:`, table.columns);

    table.rows.forEach((row, rowIndex) => {
      console.log(`Row ${rowIndex + 1}:`);

      // Access values by column name
      const description = row['Description'] || row['Item'];
      const quantity = row['Quantity'] || row['Qty'];
      const price = row['Price'] || row['Unit Price'];
      const total = row['Total'] || row['Amount'];

      console.log(`  ${description}: ${quantity} x $${price} = $${total}`);
    });
  });
});

Example 2: Convert Tables to Excel

// Extract tables and export to Excel format
const XLSX = require('xlsx');

const workbook = XLSX.utils.book_new();

$pages.forEach(page => {
  page.tables.forEach((table, tableIndex) => {
    // Convert table to array format for Excel
    const data = [];

    // Add header row
    data.push(table.columns);

    // Add data rows
    table.rows.forEach(row => {
      const rowArray = table.columns.map(col => row[col] || '');
      data.push(rowArray);
    });

    // Create worksheet
    const worksheet = XLSX.utils.aoa_to_sheet(data);
    const sheetName = `Page${page.page}_Table${tableIndex + 1}`;

    XLSX.utils.book_append_sheet(workbook, worksheet, sheetName);
  });
});

// Write to file
XLSX.writeFile(workbook, 'extracted_tables.xlsx');
console.log('Tables exported to extracted_tables.xlsx');

Example 3: Aggregate Data from Multiple Tables

// Sum values across all tables in document
let totalAmount = 0;
let itemCount = 0;

$pages.forEach(page => {
  page.tables.forEach(table => {
    table.rows.forEach(row => {
      // Extract numeric values from Amount or Total column
      const amountStr = row['Amount'] || row['Total'] || '0';
      const amount = parseFloat(amountStr.replace(/[$,]/g, ''));

      if (!isNaN(amount)) {
        totalAmount += amount;
        itemCount++;
      }
    });
  });
});

console.log(`Total items: ${itemCount}`);
console.log(`Total amount: $${totalAmount.toFixed(2)}`);

Example 4: Filter and Search Table Data

// Search for specific items in tables
const searchTerm = "Widget";
const matchingRows = [];

$pages.forEach(page => {
  page.tables.forEach(table => {
    table.rows.forEach(row => {
      // Search across all columns
      const rowText = Object.values(row).join(' ').toLowerCase();

      if (rowText.includes(searchTerm.toLowerCase())) {
        matchingRows.push({
          page: page.page,
          data: row
        });
      }
    });
  });
});

console.log(`Found ${matchingRows.length} matching rows:`);
matchingRows.forEach(match => {
  console.log(`Page ${match.page}:`, match.data);
});

Example 5: Validate Table Structure

// Validate that tables have expected columns
const expectedColumns = ['Item', 'Quantity', 'Price', 'Total'];
const issues = [];

$pages.forEach(page => {
  page.tables.forEach((table, tableIndex) => {
    expectedColumns.forEach(expectedCol => {
      const hasColumn = table.columns.some(col =>
        col.toLowerCase().includes(expectedCol.toLowerCase())
      );

      if (!hasColumn) {
        issues.push(`Page ${page.page}, Table ${tableIndex + 1}: Missing column '${expectedCol}'`);
      }
    });
  });
});

if (issues.length > 0) {
  console.warn('Validation issues:', issues);
} else {
  console.log('All tables have expected columns');
}

Tips for Effective Use

Table Detection

Clear Borders: Tables with visible borders are detected more accurately
Consistent Formatting: Use consistent spacing and alignment within tables
Header Row: Ensure the first row clearly indicates column headers
Avoid Merged Cells: Complex merged cells may impact detection accuracy

Column Mapping

Column headers are used as keys for accessing row data
Handle variations in column names (e.g., "Qty" vs "Quantity")
Use case-insensitive matching when searching for columns
Check for null or empty values in cells

Multi-Table Documents

Process all tables in order they appear on the page
Track table index to identify specific tables
Consider table position when multiple tables have similar structure
Handle pages with no tables gracefully

Data Type Handling

Table cells are extracted as strings
Convert numeric values using parseFloat() or parseInt()
Remove currency symbols and formatting before conversion
Handle date formats consistently

Performance Optimization

Process Only What's Needed: Filter pages with tables if possible
Batch Processing: Group similar documents for processing
Cache Results: Store extracted tables to avoid reprocessing
Validate Early: Check for tables before full processing

Common Errors and Solutions

Error: "File path cannot be empty"

Cause: No file path provided to the node. Solution: Ensure the File Path input is populated with a valid path.

Error: "Failed to read document file"

Cause: File not found, permission denied, or path is incorrect. Solution:

Verify the file exists at the specified path
Check file permissions allow reading
Use absolute paths instead of relative paths
Ensure the file hasn't been moved or deleted

Error: "Project ID cannot be empty"

Cause: Project ID option is not configured. Solution: Add your Google Cloud project ID in the node options.

Error: "Processor ID cannot be empty"

Cause: Processor ID option is not configured. Solution:

Create a processor in Google Cloud Console
Copy the processor ID from the processor details page
Add it to the node options

Error: "Invalid credentials format: missing content field"

Cause: Credentials are not properly formatted or stored. Solution:

Re-download Service Account JSON key from Google Cloud Console
Save the complete JSON content in Robomotion vault
Select the correct credential from the dropdown

No tables found in output

Cause: Document doesn't contain detectable tables or tables are not recognized. Solution:

Verify document contains actual tables (not just text arranged in columns)
Use tables with clear borders and structure
Check if document quality is sufficient (not blurry or skewed)
Try using a specialized processor for better table detection

Incorrect column mapping

Cause: Headers not properly detected or merged cells causing confusion. Solution:

Ensure first row clearly indicates headers
Avoid complex header structures with merged cells
Manually map columns if headers are inconsistent
Use column index instead of name for problematic tables

Missing cell values

Cause: Empty cells or detection issues in sparse tables. Solution:

Check for null or undefined values before accessing
Use default values for missing cells
Verify document doesn't have faint or light-colored text
Increase scan resolution for better detection

Use Cases

Invoice Processing

Extract line items, quantities, prices, and totals from vendor invoices for automated accounts payable processing.

Financial Reports

Extract tables from financial statements, balance sheets, and P&L reports for analysis and comparison.

Inventory Management

Process inventory lists, stock reports, and warehouse documents to update inventory systems.

Price List Updates

Extract product catalogs and price lists to update e-commerce systems or pricing databases.

Timesheet Processing

Extract employee timesheets with hours worked, tasks, and billing information for payroll processing.

Expense Reports

Process expense report tables with dates, categories, amounts, and descriptions for reimbursement workflows.

Data Migration

Extract tabular data from legacy documents during system migrations or data consolidation projects.

Compliance Reporting

Extract data tables from regulatory filings and compliance documents for auditing and reporting.

Output Structure Example

// Example output structure
{
  pages: [
    {
      page: 1,
      tables: [
        {
          columns: ['Item', 'Quantity', 'Price', 'Total'],
          rows: [
            {
              'Item': 'Widget A',
              'Quantity': '5',
              'Price': '$10.00',
              'Total': '$50.00'
            },
            {
              'Item': 'Widget B',
              'Quantity': '3',
              'Price': '$15.00',
              'Total': '$45.00'
            }
          ]
        }
      ]
    }
  ]
}

Performance Considerations

Processing Time: Typically 2-15 seconds per page depending on table complexity
Table Complexity: Documents with many tables take longer to process
Cell Count: Tables with hundreds of cells may impact processing speed
Concurrent Requests: Implement queuing for batch processing to respect rate limits
Network Latency: Choose processor location near your automation deployment
Cost: Charged per page processed; monitor usage in Google Cloud Console

Common Properties​

Inputs​

Output​

Options​

How It Works​

Requirements​

Practical Examples​

Example 1: Extract Invoice Line Items​

Example 2: Convert Tables to Excel​

Example 3: Aggregate Data from Multiple Tables​

Example 4: Filter and Search Table Data​

Example 5: Validate Table Structure​

Tips for Effective Use​

Table Detection​

Column Mapping​

Multi-Table Documents​

Data Type Handling​

Performance Optimization​

Common Errors and Solutions​

Error: "File path cannot be empty"​

Error: "Failed to read document file"​

Error: "Project ID cannot be empty"​

Error: "Processor ID cannot be empty"​

Error: "Invalid credentials format: missing content field"​

No tables found in output​

Incorrect column mapping​

Missing cell values​

Use Cases​

Invoice Processing​

Financial Reports​

Inventory Management​

Price List Updates​

Timesheet Processing​

Expense Reports​

Data Migration​

Compliance Reporting​

Output Structure Example​

Performance Considerations​