Pdf To Text
Extracts text from PDF documents using Google Vision API's document text detection feature.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
info
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Vision Client Id - The unique identifier of the Vision API connection, typically obtained from the Connect node.
- GCS Source URI - The Google Cloud Storage URI of the PDF file to process (e.g., gs://bucket-name/file.pdf).
- GCS Destination URI - The Google Cloud Storage URI where the output JSON files will be stored (e.g., gs://bucket-name/output/).
Options
No additional options available for this node.
Output
- Converted Path - The path to the output JSON file containing the extracted text.
How It Works
The Pdf To Text node uses Google Vision API to extract text from PDF documents and save the results as JSON files in Google Cloud Storage. When executed, the node:
- Retrieves the Vision API client using the provided client ID
- Validates that both source and destination URIs are not empty
- Calls the AsyncBatchAnnotateFiles method to initiate asynchronous text extraction
- Waits for the operation to complete
- Returns the path to the output JSON file containing the extracted text
Requirements
- A valid connection to Vision API established with the Connect node
- Valid Google Cloud credentials with appropriate permissions
- A PDF file stored in Google Cloud Storage
- Enabled Vision API in your Google Cloud project
- Proper permissions to read from the source GCS bucket and write to the destination GCS bucket
Error Handling
The node will return specific errors in the following cases:
- Empty or invalid Vision Client ID
- Empty GCS Source URI
- Empty GCS Destination URI
- Invalid GCS URIs
- Network connectivity issues
- Vision API service errors
- Authentication failures
- Insufficient permissions to access GCS buckets
Usage Notes
- The Vision Client ID must be obtained from a successful Connect node execution
- Both source and destination must be valid Google Cloud Storage URIs
- The source PDF file must be accessible from the specified GCS URI
- The destination GCS URI should end with a forward slash (/)
- This is an asynchronous operation that may take some time to complete depending on the PDF size
- The output is saved as JSON files in the specified destination GCS bucket
- The node processes PDF files with support for multiple pages
- The output JSON contains structured text data with positional information
- This node is useful for processing large documents and extracting searchable text
- The operation uses Google Cloud Storage for both input and output