Skip to main content

Extract Tables

Extract tables from documents (PDF, DOCX, HTML, etc.) with support for multiple output formats. Get structured table data ready for processing, analysis, or database insertion.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • File Path - Path to the document file. Works best with PDF, DOCX, HTML, and XLSX files.

Options

  • Output Format - Format for the extracted tables. Default is "HTML (Structured)".
    • HTML (Structured) - Preserves visual structure with HTML table tags. Best for rendering or maintaining formatting.
    • Text (Simple) - Plain text representation. Good for basic display or text analysis.
    • JSON (Data) - Structured data with columns and rows. Best for programmatic processing and database insertion.

Output

  • Tables - Array of extracted tables. Each table includes:
    • index - Position of the table in the document (0-based)
    • page - Page number where the table appears (if available)
    • content - Table content in the selected format (HTML or Text)
    • table - (JSON format only) Structured table object with columns and rows
    • html - (when available) Original HTML representation
  • Table Count - Number of tables found in the document.

JSON Format Structure

When using JSON output format, each table follows this structure:

{
"index": 0,
"page": 2,
"table": {
"columns": ["Name", "Age", "Location"],
"rows": [
{ "Name": "John", "Age": 54, "Location": "Boston" },
{ "Name": "Jane", "Age": 47, "Location": "London" }
]
}
}

Notes on JSON format:

  • Numeric values are automatically converted to numbers
  • Column names come from the first row of the table
  • Each row is an object with column names as keys

How It Works

The Extract Tables node scans documents for table elements and extracts them in your preferred format. When executed, the node:

  1. Validates the provided file path
  2. Parses the entire document
  3. Identifies all table elements
  4. Extracts table content with metadata (page numbers, position)
  5. Converts tables to the selected output format
  6. For JSON format, parses HTML tables into structured columns and rows
  7. Returns array of tables with count

Requirements

  • Valid file path to a document containing tables
  • Read access to the file
  • Document with properly structured tables (best results with native tables, not text formatted to look like tables)

Error Handling

The node will return specific errors in the following cases:

  • Empty or missing file path
  • File not found at the specified path
  • Unsupported document format
  • Corrupted or invalid document file
  • Insufficient permissions to read the file

Usage Examples

Example 1: Extract Tables as JSON for Processing

// Input
msg.filePath = "/reports/quarterly-results.pdf"

// Extract Tables (JSON format) output
msg.tables = [
{
"index": 0,
"page": 3,
"table": {
"columns": ["Quarter", "Revenue", "Profit"],
"rows": [
{ "Quarter": "Q1", "Revenue": 1000000, "Profit": 150000 },
{ "Quarter": "Q2", "Revenue": 1200000, "Profit": 180000 }
]
}
}
]
msg.tableCount = 1

// Process in Function node
const quarterlyData = msg.tables[0].table.rows;
const totalRevenue = quarterlyData.reduce((sum, q) => sum + q.Revenue, 0);
msg.totalRevenue = totalRevenue; // 2200000

Example 2: Loop Through Multiple Tables

// After Extract Tables
msg.tables.forEach((table, index) => {
console.log(`Table ${index + 1} on page ${table.page}`);

if (table.table) { // JSON format
console.log(`Columns: ${table.table.columns.join(', ')}`);
console.log(`Rows: ${table.table.rows.length}`);
}
});

Example 3: Insert Table Data into Database

// Extract Tables (JSON) → Function → Database Insert

// Function node
const table = msg.tables[0].table;
const insertData = table.rows.map(row => ({
...row,
source: msg.filePath,
extractedAt: new Date().toISOString()
}));

msg.data = insertData;
// Connect to Database Insert node

Example 4: HTML Output for Display

// Extract Tables (HTML format)
msg.tables = [
{
"index": 0,
"page": 1,
"content": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>Item 1</td><td>100</td></tr></table>"
}
]

// Use in email or web report
const htmlReport = `
<h1>Extracted Tables Report</h1>
${msg.tables.map(t => `
<h2>Table ${t.index + 1} (Page ${t.page})</h2>
${t.content}
`).join('')}
`;
msg.htmlReport = htmlReport;

Example 5: Filter Tables by Content

// After Extract Tables (JSON format)

// Find tables containing specific column
const financialTables = msg.tables.filter(table => {
if (!table.table) return false;
return table.table.columns.some(col =>
col.toLowerCase().includes('revenue') ||
col.toLowerCase().includes('profit')
);
});

msg.financialTables = financialTables;

Example 6: Convert Table to CSV

// After Extract Tables (JSON format)
const table = msg.tables[0].table;

// Create CSV header
const csv = [
table.columns.join(','),
...table.rows.map(row =>
table.columns.map(col => {
const value = row[col];
// Escape values with commas or quotes
return typeof value === 'string' && (value.includes(',') || value.includes('"'))
? `"${value.replace(/"/g, '""')}"`
: value;
}).join(',')
)
].join('\n');

msg.csv = csv;

Tips for Effective Use

  1. Choose the Right Output Format:

    • Use JSON for programmatic processing, database inserts, or data analysis
    • Use HTML for email reports, web display, or when formatting matters
    • Use Text for simple viewing or when you need plain text representation
  2. Best Document Types for Table Extraction:

    • PDF - Works well for native PDF tables (not scanned images)
    • DOCX - Excellent for Microsoft Word tables
    • HTML - Perfect for web-based tables
    • XLSX - Each sheet is treated as a table
    • Scanned PDFs may require OCR preprocessing
  3. Table Detection:

    • The node detects tables based on document structure
    • Tables created with spaces/tabs may not be detected (they're just formatted text)
    • Complex merged cells or nested tables may have variable results
  4. Page Numbers:

    • Page numbers are available for formats that support them (PDF, DOCX)
    • Some formats may not provide reliable page numbers
  5. Data Type Conversion (JSON format):

    • Numbers are automatically converted from strings
    • Dates remain as strings (convert in a Function node if needed)
    • Empty cells become null

Common Errors and Solutions

Error: "File path is required"

Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.

Error: "File does not exist"

Cause: The specified file path is invalid or the file doesn't exist. Solution:

  • Verify the file path is correct
  • Check that the file exists at that location
  • Use absolute paths to avoid ambiguity

Error: "Failed to extract tables"

Cause: Document is corrupted, password-protected, or has parsing issues. Solution:

  • Verify the file opens correctly in its native application
  • Remove password protection if present
  • Try a different document format

Warning: Table Count is 0

Cause: No tables were detected in the document. Solution:

  • Verify the document actually contains tables
  • Check if tables are images (OCR preprocessing needed)
  • Ensure tables are properly structured (not just text aligned with spaces)

Issue: Incorrect Table Structure in JSON

Cause: Complex table with merged cells or unusual formatting. Solution:

  • Simplify table structure in the source document
  • Use HTML format to inspect the raw table structure
  • Consider manual post-processing for complex tables

Integration with Other Nodes

Extract and Store in Database

Extract Tables (JSON) → Loop (tables) → Function (format) → Database Insert

Create Excel Report from Multiple PDFs

Loop (PDF files) → Extract Tables → Append to Array → Excel: Write Sheet

Table Data for AI Analysis

Extract Tables (JSON) → Function (convert to text) → OpenAI Chat → Analyze Data

Email Tables Report

Extract Tables (HTML) → Function (create email body) → Mail: Send Email

Performance Considerations

  • Speed: Table extraction requires full document parsing, similar to Read Document
  • Complex Tables: Nested or merged cell tables take longer to process
  • Large Documents: Documents with many tables may require more processing time
  • Format Impact: XLSX and DOCX are generally faster than PDF for table extraction

Best Practices

  1. Validate Table Count: Always check tableCount before processing to ensure tables were found
  2. Handle Empty Tables: Some tables may have no data rows, check rows.length in JSON format
  3. Column Name Consistency: Column names come from the first row; ensure your source documents have proper headers
  4. Error Handling: Use Try-Catch nodes when processing multiple documents
  5. Format Selection: Choose JSON for automation, HTML for human-readable output