Extract Tables

Extract tables from documents (PDF, DOCX, HTML, etc.) with support for multiple output formats. Get structured table data ready for processing, analysis, or database insertion.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

File Path - Path to the document file. Works best with PDF, DOCX, HTML, and XLSX files.

Options

Output Format - Format for the extracted tables. Default is "HTML (Structured)".
- HTML (Structured) - Preserves visual structure with HTML table tags. Best for rendering or maintaining formatting.
- Text (Simple) - Plain text representation. Good for basic display or text analysis.
- JSON (Data) - Structured data with columns and rows. Best for programmatic processing and database insertion.

Output

Tables - Array of extracted tables. Each table includes:
- index - Position of the table in the document (0-based)
- page - Page number where the table appears (if available)
- content - Table content in the selected format (HTML or Text)
- table - (JSON format only) Structured table object with columns and rows
- html - (when available) Original HTML representation
Table Count - Number of tables found in the document.

JSON Format Structure

When using JSON output format, each table follows this structure:

{
  "index": 0,
  "page": 2,
  "table": {
    "columns": ["Name", "Age", "Location"],
    "rows": [
      { "Name": "John", "Age": 54, "Location": "Boston" },
      { "Name": "Jane", "Age": 47, "Location": "London" }
    ]
  }
}

Notes on JSON format:

Numeric values are automatically converted to numbers
Column names come from the first row of the table
Each row is an object with column names as keys

How It Works

The Extract Tables node scans documents for table elements and extracts them in your preferred format. When executed, the node:

Validates the provided file path
Parses the entire document
Identifies all table elements
Extracts table content with metadata (page numbers, position)
Converts tables to the selected output format
For JSON format, parses HTML tables into structured columns and rows
Returns array of tables with count

Requirements

Valid file path to a document containing tables
Read access to the file
Document with properly structured tables (best results with native tables, not text formatted to look like tables)

Error Handling

The node will return specific errors in the following cases:

Empty or missing file path
File not found at the specified path
Unsupported document format
Corrupted or invalid document file
Insufficient permissions to read the file

Usage Examples

Example 1: Extract Tables as JSON for Processing

// Input
msg.filePath = "/reports/quarterly-results.pdf"

// Extract Tables (JSON format) output
msg.tables = [
  {
    "index": 0,
    "page": 3,
    "table": {
      "columns": ["Quarter", "Revenue", "Profit"],
      "rows": [
        { "Quarter": "Q1", "Revenue": 1000000, "Profit": 150000 },
        { "Quarter": "Q2", "Revenue": 1200000, "Profit": 180000 }
      ]
    }
  }
]
msg.tableCount = 1

// Process in Function node
const quarterlyData = msg.tables[0].table.rows;
const totalRevenue = quarterlyData.reduce((sum, q) => sum + q.Revenue, 0);
msg.totalRevenue = totalRevenue; // 2200000

Example 2: Loop Through Multiple Tables

// After Extract Tables
msg.tables.forEach((table, index) => {
  console.log(`Table ${index + 1} on page ${table.page}`);

  if (table.table) { // JSON format
    console.log(`Columns: ${table.table.columns.join(', ')}`);
    console.log(`Rows: ${table.table.rows.length}`);
  }
});

Example 3: Insert Table Data into Database

// Extract Tables (JSON) → Function → Database Insert

// Function node
const table = msg.tables[0].table;
const insertData = table.rows.map(row => ({
  ...row,
  source: msg.filePath,
  extractedAt: new Date().toISOString()
}));

msg.data = insertData;
// Connect to Database Insert node

Example 4: HTML Output for Display

// Extract Tables (HTML format)
msg.tables = [
  {
    "index": 0,
    "page": 1,
    "content": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>Item 1</td><td>100</td></tr></table>"
  }
]

// Use in email or web report
const htmlReport = `
  <h1>Extracted Tables Report</h1>
  ${msg.tables.map(t => `
    <h2>Table ${t.index + 1} (Page ${t.page})</h2>
    ${t.content}
  `).join('')}
`;
msg.htmlReport = htmlReport;

Example 5: Filter Tables by Content

// After Extract Tables (JSON format)

// Find tables containing specific column
const financialTables = msg.tables.filter(table => {
  if (!table.table) return false;
  return table.table.columns.some(col =>
    col.toLowerCase().includes('revenue') ||
    col.toLowerCase().includes('profit')
  );
});

msg.financialTables = financialTables;

Example 6: Convert Table to CSV

// After Extract Tables (JSON format)
const table = msg.tables[0].table;

// Create CSV header
const csv = [
  table.columns.join(','),
  ...table.rows.map(row =>
    table.columns.map(col => {
      const value = row[col];
      // Escape values with commas or quotes
      return typeof value === 'string' && (value.includes(',') || value.includes('"'))
        ? `"${value.replace(/"/g, '""')}"`
        : value;
    }).join(',')
  )
].join('\n');

msg.csv = csv;

Tips for Effective Use

Choose the Right Output Format:
- Use JSON for programmatic processing, database inserts, or data analysis
- Use HTML for email reports, web display, or when formatting matters
- Use Text for simple viewing or when you need plain text representation
Best Document Types for Table Extraction:
- PDF - Works well for native PDF tables (not scanned images)
- DOCX - Excellent for Microsoft Word tables
- HTML - Perfect for web-based tables
- XLSX - Each sheet is treated as a table
- Scanned PDFs may require OCR preprocessing
Table Detection:
- The node detects tables based on document structure
- Tables created with spaces/tabs may not be detected (they're just formatted text)
- Complex merged cells or nested tables may have variable results
Page Numbers:
- Page numbers are available for formats that support them (PDF, DOCX)
- Some formats may not provide reliable page numbers
Data Type Conversion (JSON format):
- Numbers are automatically converted from strings
- Dates remain as strings (convert in a Function node if needed)
- Empty cells become null

Common Errors and Solutions

Error: "File path is required"

Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.

Error: "File does not exist"

Cause: The specified file path is invalid or the file doesn't exist. Solution:

Verify the file path is correct
Check that the file exists at that location
Use absolute paths to avoid ambiguity

Error: "Failed to extract tables"

Cause: Document is corrupted, password-protected, or has parsing issues. Solution:

Verify the file opens correctly in its native application
Remove password protection if present
Try a different document format

Warning: Table Count is 0

Cause: No tables were detected in the document. Solution:

Verify the document actually contains tables
Check if tables are images (OCR preprocessing needed)
Ensure tables are properly structured (not just text aligned with spaces)

Issue: Incorrect Table Structure in JSON

Cause: Complex table with merged cells or unusual formatting. Solution:

Simplify table structure in the source document
Use HTML format to inspect the raw table structure
Consider manual post-processing for complex tables

Integration with Other Nodes

Extract and Store in Database

Extract Tables (JSON) → Loop (tables) → Function (format) → Database Insert

Create Excel Report from Multiple PDFs

Loop (PDF files) → Extract Tables → Append to Array → Excel: Write Sheet

Table Data for AI Analysis

Extract Tables (JSON) → Function (convert to text) → OpenAI Chat → Analyze Data

Email Tables Report

Extract Tables (HTML) → Function (create email body) → Mail: Send Email

Performance Considerations

Speed: Table extraction requires full document parsing, similar to Read Document
Complex Tables: Nested or merged cell tables take longer to process
Large Documents: Documents with many tables may require more processing time
Format Impact: XLSX and DOCX are generally faster than PDF for table extraction

Best Practices

Validate Table Count: Always check tableCount before processing to ensure tables were found
Handle Empty Tables: Some tables may have no data rows, check rows.length in JSON format
Column Name Consistency: Column names come from the first row; ensure your source documents have proper headers
Error Handling: Use Try-Catch nodes when processing multiple documents
Format Selection: Choose JSON for automation, HTML for human-readable output

Common Properties​

Inputs​

Options​

Output​

JSON Format Structure​

How It Works​

Requirements​

Error Handling​

Usage Examples​

Example 1: Extract Tables as JSON for Processing​

Example 2: Loop Through Multiple Tables​

Example 3: Insert Table Data into Database​

Example 4: HTML Output for Display​

Example 5: Filter Tables by Content​

Example 6: Convert Table to CSV​

Tips for Effective Use​

Common Errors and Solutions​

Error: "File path is required"​

Error: "File does not exist"​

Error: "Failed to extract tables"​

Warning: Table Count is 0​

Issue: Incorrect Table Structure in JSON​

Integration with Other Nodes​

Extract and Store in Database​

Create Excel Report from Multiple PDFs​

Table Data for AI Analysis​

Email Tables Report​

Performance Considerations​

Best Practices​