Pdf To Data Table
Extracts tables from a PDF file and converts them to data table format.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
info
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- PDF File Path - The file path to the PDF file containing tables to be extracted.
Options
- Pages - Specifies which pages to extract tables from. Options include:
- "all" - Extract tables from all pages
- "1-2,3" - Extract tables from pages 1, 2, and 3
- "[1,2]" - Extract tables from pages 1 and 2
Output
- Table List - A list of data tables extracted from the PDF file.
How It Works
The Pdf To Data Table node extracts tables from a PDF file and converts them to data table format. When executed, the node:
- Validates that the PDF file path is not empty
- Checks if the PDF file exists at the specified path
- Sets the TABULA_JAR environment variable for table extraction
- Reads the PDF file and extracts tables using the tabula library
- Converts each extracted table to a dictionary format with columns and rows
- Replaces any NaN values with empty strings
- Returns a list of all extracted tables
Requirements
- A valid PDF file path
- The PDF file must exist at the specified location
- Proper file system permissions to read the file
- The tabula-1.0.5-jar-with-dependencies.jar file must be available
Error Handling
The node will return specific errors in the following cases:
- Empty PDF file path
- PDF file does not exist at the specified path
- No tables found in the PDF file
- Invalid page specification
- Insufficient permissions to read the file
Usage Notes
- The output is a list of tables, as a single PDF file may contain multiple tables
- The Pages option allows you to specify which pages to extract tables from
- NaN values in the PDF tables are automatically converted to empty strings
- The node uses the tabula library for PDF table extraction
- The first row of each extracted table is used as column headers