Extract Text

Extracts text content from a PDF document.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

In PDF File Path - The path to the PDF file from which text will be extracted.

Out Text File Path - Optional path to save the extracted text as a text file.
Password - The password for the PDF document if it is encrypted.
X - X coordinate of a rectangle area to extract text from.
Y - Y coordinate of a rectangle area to extract text from.
Width - Width of a rectangle area to extract text from.
Height - Height of a rectangle area to extract text from.
Page Index - The zero-based index of the page to extract text from. If left blank, all pages are processed.

The Extract Text node reads text content from a PDF document. When executed, the node:

Validates that the input PDF path is not empty
Loads the PDF document (using the password if provided)
If rectangle coordinates (X, Y, Width, Height) are provided:
- Extracts text only from the specified rectangular area
- Uses the Page Index if provided, otherwise uses the first page
If no rectangle coordinates are provided:
- Extracts text from the entire document or specified page
If an output text file path is provided, saves the extracted text to that file
Returns the extracted text as output
Closes the document

The node will return specific errors in the following cases:

When extracting text from a specific area, all four coordinates (X, Y, Width, Height) must be provided
The Page Index is required when extracting text from a specific area
Coordinate system starts at (0,0) in the top-left corner of the page
X increases towards the right side of the page
Y increases towards the bottom of the page
If no Page Index is specified, the node processes all pages of the document
The extracted text preserves the original formatting as much as possible