Skip to main content

Extract Text

Extracts text content from a PDF document.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • In PDF File Path - The path to the PDF file from which text will be extracted.

Options

  • Out Text File Path - Optional path to save the extracted text as a text file.
  • Password - The password for the PDF document if it is encrypted.
  • X - X coordinate of a rectangle area to extract text from.
  • Y - Y coordinate of a rectangle area to extract text from.
  • Width - Width of a rectangle area to extract text from.
  • Height - Height of a rectangle area to extract text from.
  • Page Index - The zero-based index of the page to extract text from. If left blank, all pages are processed.

Output

  • Text - The extracted text content from the PDF.

How It Works

The Extract Text node reads text content from a PDF document. When executed, the node:

  1. Validates that the input PDF path is not empty
  2. Loads the PDF document (using the password if provided)
  3. If rectangle coordinates (X, Y, Width, Height) are provided:
    • Extracts text only from the specified rectangular area
    • Uses the Page Index if provided, otherwise uses the first page
  4. If no rectangle coordinates are provided:
    • Extracts text from the entire document or specified page
  5. If an output text file path is provided, saves the extracted text to that file
  6. Returns the extracted text as output
  7. Closes the document

Requirements

  • A valid PDF file path
  • Sufficient permissions to read the input file
  • Valid output file path (if specified)
  • Valid rectangle coordinates (if using area extraction)

Error Handling

The node will return specific errors in the following cases:

  • Empty or invalid PDF File Path
  • Invalid rectangle coordinates (negative values or non-numeric strings)
  • Invalid page index (negative values or non-numeric strings)
  • File I/O errors
  • PDF processing errors

Usage Notes

  • When extracting text from a specific area, all four coordinates (X, Y, Width, Height) must be provided
  • The Page Index is required when extracting text from a specific area
  • Coordinate system starts at (0,0) in the top-left corner of the page
  • X increases towards the right side of the page
  • Y increases towards the bottom of the page
  • If no Page Index is specified, the node processes all pages of the document
  • The extracted text preserves the original formatting as much as possible