Extract Text
Extracts text content from a PDF document.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
info
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- In PDF File Path - The path to the PDF file from which text will be extracted.
Options
- Out Text File Path - Optional path to save the extracted text as a text file.
- Password - The password for the PDF document if it is encrypted.
- X - X coordinate of a rectangle area to extract text from.
- Y - Y coordinate of a rectangle area to extract text from.
- Width - Width of a rectangle area to extract text from.
- Height - Height of a rectangle area to extract text from.
- Page Index - The zero-based index of the page to extract text from. If left blank, all pages are processed.
Output
- Text - The extracted text content from the PDF.
How It Works
The Extract Text node reads text content from a PDF document. When executed, the node:
- Validates that the input PDF path is not empty
- Loads the PDF document (using the password if provided)
- If rectangle coordinates (X, Y, Width, Height) are provided:
- Extracts text only from the specified rectangular area
- Uses the Page Index if provided, otherwise uses the first page
- If no rectangle coordinates are provided:
- Extracts text from the entire document or specified page
- If an output text file path is provided, saves the extracted text to that file
- Returns the extracted text as output
- Closes the document
Requirements
- A valid PDF file path
- Sufficient permissions to read the input file
- Valid output file path (if specified)
- Valid rectangle coordinates (if using area extraction)
Error Handling
The node will return specific errors in the following cases:
- Empty or invalid PDF File Path
- Invalid rectangle coordinates (negative values or non-numeric strings)
- Invalid page index (negative values or non-numeric strings)
- File I/O errors
- PDF processing errors
Usage Notes
- When extracting text from a specific area, all four coordinates (X, Y, Width, Height) must be provided
- The Page Index is required when extracting text from a specific area
- Coordinate system starts at (0,0) in the top-left corner of the page
- X increases towards the right side of the page
- Y increases towards the bottom of the page
- If no Page Index is specified, the node processes all pages of the document
- The extracted text preserves the original formatting as much as possible