Skip to main content

Pdf To Text

Extracts text from PDF documents using Google Vision API's document text detection feature.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Vision Client Id - The unique identifier of the Vision API connection, typically obtained from the Connect node.
  • GCS Source URI - The Google Cloud Storage URI of the PDF file to process (e.g., gs://bucket-name/file.pdf).
  • GCS Destination URI - The Google Cloud Storage URI where the output JSON files will be stored (e.g., gs://bucket-name/output/).

Options

  • Credentials - Google Cloud service account credentials (optional - use instead of Connect node). If provided, the node will create its own client connection without requiring a Vision Client ID.

Output

  • Converted Path - The path to the output JSON file containing the extracted text.

How It Works

The Pdf To Text node uses Google Vision API to extract text from PDF documents and save the results as JSON files in Google Cloud Storage. When executed, the node:

  1. Retrieves the Vision API client using the provided client ID
  2. Validates that both source and destination URIs are not empty
  3. Calls the AsyncBatchAnnotateFiles method to initiate asynchronous text extraction
  4. Waits for the operation to complete
  5. Returns the path to the output JSON file containing the extracted text

Requirements

  • A valid connection to Vision API established with the Connect node
  • Valid Google Cloud credentials with appropriate permissions
  • A PDF file stored in Google Cloud Storage
  • Enabled Vision API in your Google Cloud project
  • Proper permissions to read from the source GCS bucket and write to the destination GCS bucket

Error Handling

The node will return specific errors in the following cases:

  • Empty or invalid Vision Client ID
  • Empty GCS Source URI
  • Empty GCS Destination URI
  • Invalid GCS URIs
  • Network connectivity issues
  • Vision API service errors
  • Authentication failures
  • Insufficient permissions to access GCS buckets

Usage Notes

  • The Vision Client ID must be obtained from a successful Connect node execution (or provide credentials directly)
  • Both source and destination must be valid Google Cloud Storage URIs starting with gs://
  • The source PDF file must be accessible from the specified GCS URI
  • The destination GCS URI should end with a forward slash (/) to specify a directory
  • This is an asynchronous operation that may take some time to complete depending on the PDF size
  • The output is saved as JSON files in the specified destination GCS bucket
  • The node processes PDF files with support for multiple pages (batch size: 2 pages per JSON file)
  • The output JSON contains structured text data with positional information, bounding boxes, and confidence scores
  • Output file naming format: output-1-to-2.json, output-3-to-4.json, etc.
  • For local PDF files, upload to GCS first using the Google Storage package
  • The operation uses Google Cloud Storage for both input and output
  • Supports both application/pdf and image/tiff file formats

Example Use Cases

Extract Text from Invoice PDFs in GCS

Input:
Vision Client Id: (from Connect node)
GCS Source URI: gs://my-company-docs/invoices/invoice_2024_03.pdf
GCS Destination URI: gs://my-company-docs/extracted/

Output:
Converted Path: gs://my-company-docs/extracted/output-1-to-2.json

Next Steps:
1. Download JSON file using Google Storage nodes
2. Parse JSON to extract text and structured data
3. Import into database or spreadsheet

Batch Process Document Archive

1. Google Storage > List Files
Bucket: document-archive
Prefix: legal/contracts/
Output: pdf_files

2. Loop through pdf_files:
- PDF to Text
GCS Source URI: gs://document-archive/{{file.name}}
GCS Destination URI: gs://document-archive/extracted/{{file.name}}/
Output: json_path

- Google Storage > Download File
GCS URI: {{json_path}}
Local Path: /temp/{{file.name}}.json

- Programming > Evaluate
Parse JSON and extract relevant contract data

- Database > Insert
Store extracted contract information

Process Scanned Documents

1. Upload local PDF to GCS:
Google Storage > Upload File
Local Path: /scans/document_001.pdf
GCS URI: gs://scanning-bucket/uploads/document_001.pdf

2. PDF to Text
GCS Source URI: gs://scanning-bucket/uploads/document_001.pdf
GCS Destination URI: gs://scanning-bucket/processed/
Output: result_path

3. Download and process results:
Google Storage > Download File
GCS URI: {{result_path}}
Local Path: /processed/document_001.json

4. Programming > Evaluate
Parse JSON to extract text:
const fs = require('fs');
const data = JSON.parse(fs.readFileSync('/processed/document_001.json'));
const fullText = data.responses[0].fullTextAnnotation.text;
message.extracted_text = fullText;

Multi-Page PDF Processing

Input:
GCS Source URI: gs://documents/report_50_pages.pdf
GCS Destination URI: gs://documents/output/

Output:
Converted Path: gs://documents/output/output-1-to-2.json

Note: For a 50-page PDF, multiple JSON files will be created:
- output-1-to-2.json (pages 1-2)
- output-3-to-4.json (pages 3-4)
- output-5-to-6.json (pages 5-6)
... and so on

To process all pages:
1. List all output JSON files in gs://documents/output/
2. Download and parse each JSON file
3. Combine extracted text from all files
1. PDF to Text
GCS Source URI: gs://legal-docs/contracts/{{contract_id}}.pdf
GCS Destination URI: gs://legal-docs/analysis/{{contract_id}}/

2. Google Storage > List Files
Bucket: legal-docs
Prefix: analysis/{{contract_id}}/
Output: json_files

3. Loop through json_files:
- Download and parse each JSON file
- Extract text from each page
- Search for specific clauses and terms
- Flag important sections for review

4. Generate summary report with extracted information

OCR for Searchable PDF Creation

1. Upload scanned PDF to GCS
2. PDF to Text to extract text and positions
3. Download JSON results
4. Parse JSON to get text with bounding box coordinates
5. Use coordinates to create searchable PDF overlay
6. Save searchable PDF back to GCS

Understanding the Output JSON

The output JSON file contains detailed information about the extracted text:

{
"responses": [
{
"fullTextAnnotation": {
"pages": [
{
"property": {
"detectedLanguages": [...]
},
"width": 1700,
"height": 2200,
"blocks": [...],
"confidence": 0.98
}
],
"text": "Full extracted text from the document..."
}
}
]
}

Key elements:

  • fullTextAnnotation.text: Complete text content
  • pages: Array of page objects with structure and layout
  • blocks: Text blocks with bounding boxes and positioning
  • confidence: OCR confidence score per page (0.0 to 1.0)
  • detectedLanguages: Languages detected in the document

Tips

  • GCS Setup: Ensure your service account has read access to source bucket and write access to destination bucket
  • URI Format: Always use gs://bucket-name/path format for GCS URIs
  • Batch Size: Default batch size is 2 pages per JSON file - process all output files for complete document
  • Large PDFs: For very large PDFs, processing may take several minutes - use appropriate timeout settings
  • File Naming: Output files follow pattern output-X-to-Y.json where X and Y are page numbers
  • Destination Directory: Always end destination URI with / to create files in a directory
  • Local Files: Use Google Storage > Upload File to move local PDFs to GCS first
  • Parsing Results: Use JSON parsing to extract text from the structured output
  • Error Recovery: Implement retry logic for long-running operations
  • Cost Optimization: Vision API charges per page, so monitor usage for large document sets

Common Errors and Solutions

Error: "GCS Source URI cannot be empty"

Solution: Provide a valid Google Cloud Storage URI starting with gs://

Error: "GCS Destination URI cannot be empty"

Solution: Provide a valid destination URI ending with / for directory output

Error: "Invalid GCS URI format"

Solution: Ensure URIs start with gs:// and follow format: gs://bucket-name/path/to/file

Error: "Permission denied"

Solution:

  • Verify service account has storage.objects.get permission on source bucket
  • Verify service account has storage.objects.create permission on destination bucket
  • Check IAM roles in Google Cloud Console

Error: "File not found"

Solution:

  • Verify the source file exists in GCS
  • Check bucket name and file path are correct
  • Ensure file has been fully uploaded before processing

Error: "Operation timeout"

Solution:

  • Increase timeout setting for large PDF files
  • Check network connectivity to Google Cloud
  • Verify PDF file is not corrupted

Output file not found

Solution:

  • Wait for asynchronous operation to complete
  • Check destination bucket for output files
  • Verify destination URI has correct write permissions
  • Look for files matching pattern output-1-to-2.json

Working with Output Files

Download and Parse Results

// In Programming > Evaluate node
const fs = require('fs');
const path = require('path');

// Read the JSON output file
const jsonData = fs.readFileSync('/path/to/output-1-to-2.json', 'utf8');
const result = JSON.parse(jsonData);

// Extract full text
const fullText = result.responses[0].fullTextAnnotation.text;

// Extract page-by-page information
const pages = result.responses[0].fullTextAnnotation.pages;
pages.forEach((page, index) => {
console.log(`Page ${index + 1} confidence: ${page.confidence}`);
});

// Store extracted text
message.document_text = fullText;
message.page_count = pages.length;

Process Multiple Output Files

// Process all JSON files from multi-page PDF
const files = message.json_files; // from Google Storage > List Files
let combinedText = '';

files.forEach(file => {
const content = fs.readFileSync(file.localPath, 'utf8');
const data = JSON.parse(content);
combinedText += data.responses[0].fullTextAnnotation.text + '\n\n';
});

message.complete_document = combinedText;

Integration with Other Packages

With Google Storage

  1. Upload local PDF with Google Storage > Upload File
  2. Process with PDF to Text
  3. Download results with Google Storage > Download File
  4. Parse JSON to extract text

With Database

  1. Extract text from PDF
  2. Parse JSON results
  3. Store in database with metadata (filename, page count, confidence scores)
  4. Create searchable document index

With Google Gemini

  1. Extract text from PDF using PDF to Text
  2. Parse JSON to get document text
  3. Send to Google Gemini for analysis, summarization, or Q&A
  4. Process AI-generated insights