Skip to main content

Pdf To Data Table

Extracts tables from a PDF file and converts them to data table format.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • PDF File Path - The file path to the PDF file containing tables to be extracted.

Options

  • Pages - Specifies which pages to extract tables from. Options include:
    • "all" - Extract tables from all pages
    • "1-2,3" - Extract tables from pages 1, 2, and 3
    • "[1,2]" - Extract tables from pages 1 and 2

Output

  • Table List - A list of data tables extracted from the PDF file.

How It Works

The Pdf To Data Table node extracts tables from a PDF file and converts them to data table format. When executed, the node:

  1. Validates that the PDF file path is not empty
  2. Checks if the PDF file exists at the specified path
  3. Sets the TABULA_JAR environment variable for table extraction
  4. Reads the PDF file and extracts tables using the tabula library
  5. Converts each extracted table to a dictionary format with columns and rows
  6. Replaces any NaN values with empty strings
  7. Returns a list of all extracted tables

Requirements

  • A valid PDF file path
  • The PDF file must exist at the specified location
  • Proper file system permissions to read the file
  • The tabula-1.0.5-jar-with-dependencies.jar file must be available

Error Handling

The node will return specific errors in the following cases:

  • Empty PDF file path
  • PDF file does not exist at the specified path
  • No tables found in the PDF file
  • Invalid page specification
  • Insufficient permissions to read the file

Usage Notes

  • The output is a list of tables, as a single PDF file may contain multiple tables
  • The Pages option allows you to specify which pages to extract tables from
  • NaN values in the PDF tables are automatically converted to empty strings
  • The node uses the tabula library for PDF table extraction
  • The first row of each extracted table is used as column headers