Pdf To Text

Extracts text from PDF documents using Google Vision API's document text detection feature.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Vision Client Id - The unique identifier of the Vision API connection, typically obtained from the Connect node.
GCS Source URI - The Google Cloud Storage URI of the PDF file to process (e.g., gs://bucket-name/file.pdf).
GCS Destination URI - The Google Cloud Storage URI where the output JSON files will be stored (e.g., gs://bucket-name/output/).

Options

Credentials - Google Cloud service account credentials (optional - use instead of Connect node). If provided, the node will create its own client connection without requiring a Vision Client ID.

Output

Converted Path - The path to the output JSON file containing the extracted text.

How It Works

The Pdf To Text node uses Google Vision API to extract text from PDF documents and save the results as JSON files in Google Cloud Storage. When executed, the node:

Retrieves the Vision API client using the provided client ID
Validates that both source and destination URIs are not empty
Calls the AsyncBatchAnnotateFiles method to initiate asynchronous text extraction
Waits for the operation to complete
Returns the path to the output JSON file containing the extracted text

Requirements

A valid connection to Vision API established with the Connect node
Valid Google Cloud credentials with appropriate permissions
A PDF file stored in Google Cloud Storage
Enabled Vision API in your Google Cloud project
Proper permissions to read from the source GCS bucket and write to the destination GCS bucket

Error Handling

The node will return specific errors in the following cases:

Empty or invalid Vision Client ID
Empty GCS Source URI
Empty GCS Destination URI
Invalid GCS URIs
Network connectivity issues
Vision API service errors
Authentication failures
Insufficient permissions to access GCS buckets

Usage Notes

The Vision Client ID must be obtained from a successful Connect node execution (or provide credentials directly)
Both source and destination must be valid Google Cloud Storage URIs starting with gs://
The source PDF file must be accessible from the specified GCS URI
The destination GCS URI should end with a forward slash (/) to specify a directory
This is an asynchronous operation that may take some time to complete depending on the PDF size
The output is saved as JSON files in the specified destination GCS bucket
The node processes PDF files with support for multiple pages (batch size: 2 pages per JSON file)
The output JSON contains structured text data with positional information, bounding boxes, and confidence scores
Output file naming format: output-1-to-2.json, output-3-to-4.json, etc.
For local PDF files, upload to GCS first using the Google Storage package
The operation uses Google Cloud Storage for both input and output
Supports both application/pdf and image/tiff file formats

Example Use Cases

Extract Text from Invoice PDFs in GCS

Input:
  Vision Client Id: (from Connect node)
  GCS Source URI: gs://my-company-docs/invoices/invoice_2024_03.pdf
  GCS Destination URI: gs://my-company-docs/extracted/

Output:
  Converted Path: gs://my-company-docs/extracted/output-1-to-2.json

Next Steps:
  1. Download JSON file using Google Storage nodes
  2. Parse JSON to extract text and structured data
  3. Import into database or spreadsheet

Batch Process Document Archive

1. Google Storage > List Files
   Bucket: document-archive
   Prefix: legal/contracts/
   Output: pdf_files

2. Loop through pdf_files:
   - PDF to Text
     GCS Source URI: gs://document-archive/{{file.name}}
     GCS Destination URI: gs://document-archive/extracted/{{file.name}}/
     Output: json_path

   - Google Storage > Download File
     GCS URI: {{json_path}}
     Local Path: /temp/{{file.name}}.json

   - Programming > Evaluate
     Parse JSON and extract relevant contract data

   - Database > Insert
     Store extracted contract information

Process Scanned Documents

1. Upload local PDF to GCS:
   Google Storage > Upload File
   Local Path: /scans/document_001.pdf
   GCS URI: gs://scanning-bucket/uploads/document_001.pdf

2. PDF to Text
   GCS Source URI: gs://scanning-bucket/uploads/document_001.pdf
   GCS Destination URI: gs://scanning-bucket/processed/
   Output: result_path

3. Download and process results:
   Google Storage > Download File
   GCS URI: {{result_path}}
   Local Path: /processed/document_001.json

4. Programming > Evaluate
   Parse JSON to extract text:
   const fs = require('fs');
   const data = JSON.parse(fs.readFileSync('/processed/document_001.json'));
   const fullText = data.responses[0].fullTextAnnotation.text;
   message.extracted_text = fullText;

Multi-Page PDF Processing

Input:
  GCS Source URI: gs://documents/report_50_pages.pdf
  GCS Destination URI: gs://documents/output/

Output:
  Converted Path: gs://documents/output/output-1-to-2.json

Note: For a 50-page PDF, multiple JSON files will be created:
  - output-1-to-2.json (pages 1-2)
  - output-3-to-4.json (pages 3-4)
  - output-5-to-6.json (pages 5-6)
  ... and so on

To process all pages:
1. List all output JSON files in gs://documents/output/
2. Download and parse each JSON file
3. Combine extracted text from all files

Legal Document Analysis

1. PDF to Text
   GCS Source URI: gs://legal-docs/contracts/{{contract_id}}.pdf
   GCS Destination URI: gs://legal-docs/analysis/{{contract_id}}/

2. Google Storage > List Files
   Bucket: legal-docs
   Prefix: analysis/{{contract_id}}/
   Output: json_files

3. Loop through json_files:
   - Download and parse each JSON file
   - Extract text from each page
   - Search for specific clauses and terms
   - Flag important sections for review

4. Generate summary report with extracted information

OCR for Searchable PDF Creation

Upload scanned PDF to GCS
PDF to Text to extract text and positions
Download JSON results
Parse JSON to get text with bounding box coordinates
Use coordinates to create searchable PDF overlay
Save searchable PDF back to GCS

Understanding the Output JSON

The output JSON file contains detailed information about the extracted text:

{
  "responses": [
    {
      "fullTextAnnotation": {
        "pages": [
          {
            "property": {
              "detectedLanguages": [...]
            },
            "width": 1700,
            "height": 2200,
            "blocks": [...],
            "confidence": 0.98
          }
        ],
        "text": "Full extracted text from the document..."
      }
    }
  ]
}

Key elements:

fullTextAnnotation.text: Complete text content
pages: Array of page objects with structure and layout
blocks: Text blocks with bounding boxes and positioning
confidence: OCR confidence score per page (0.0 to 1.0)
detectedLanguages: Languages detected in the document

Tips

GCS Setup: Ensure your service account has read access to source bucket and write access to destination bucket
URI Format: Always use gs://bucket-name/path format for GCS URIs
Batch Size: Default batch size is 2 pages per JSON file - process all output files for complete document
Large PDFs: For very large PDFs, processing may take several minutes - use appropriate timeout settings
File Naming: Output files follow pattern output-X-to-Y.json where X and Y are page numbers
Destination Directory: Always end destination URI with / to create files in a directory
Local Files: Use Google Storage > Upload File to move local PDFs to GCS first
Parsing Results: Use JSON parsing to extract text from the structured output
Error Recovery: Implement retry logic for long-running operations
Cost Optimization: Vision API charges per page, so monitor usage for large document sets

Common Errors and Solutions

Error: "GCS Source URI cannot be empty"

Solution: Provide a valid Google Cloud Storage URI starting with gs://

Error: "GCS Destination URI cannot be empty"

Solution: Provide a valid destination URI ending with / for directory output

Error: "Invalid GCS URI format"

Solution: Ensure URIs start with gs:// and follow format: gs://bucket-name/path/to/file

Error: "Permission denied"

Solution:

Verify service account has storage.objects.get permission on source bucket
Verify service account has storage.objects.create permission on destination bucket
Check IAM roles in Google Cloud Console

Error: "File not found"

Solution:

Verify the source file exists in GCS
Check bucket name and file path are correct
Ensure file has been fully uploaded before processing

Error: "Operation timeout"

Solution:

Increase timeout setting for large PDF files
Check network connectivity to Google Cloud
Verify PDF file is not corrupted

Output file not found

Solution:

Wait for asynchronous operation to complete
Check destination bucket for output files
Verify destination URI has correct write permissions
Look for files matching pattern output-1-to-2.json

Working with Output Files

Download and Parse Results

// In Programming > Evaluate node
const fs = require('fs');
const path = require('path');

// Read the JSON output file
const jsonData = fs.readFileSync('/path/to/output-1-to-2.json', 'utf8');
const result = JSON.parse(jsonData);

// Extract full text
const fullText = result.responses[0].fullTextAnnotation.text;

// Extract page-by-page information
const pages = result.responses[0].fullTextAnnotation.pages;
pages.forEach((page, index) => {
  console.log(`Page ${index + 1} confidence: ${page.confidence}`);
});

// Store extracted text
message.document_text = fullText;
message.page_count = pages.length;

Process Multiple Output Files

// Process all JSON files from multi-page PDF
const files = message.json_files; // from Google Storage > List Files
let combinedText = '';

files.forEach(file => {
  const content = fs.readFileSync(file.localPath, 'utf8');
  const data = JSON.parse(content);
  combinedText += data.responses[0].fullTextAnnotation.text + '\n\n';
});

message.complete_document = combinedText;

Integration with Other Packages

With Google Storage

Upload local PDF with Google Storage > Upload File
Process with PDF to Text
Download results with Google Storage > Download File
Parse JSON to extract text

With Database

Extract text from PDF
Parse JSON results
Store in database with metadata (filename, page count, confidence scores)
Create searchable document index

With Google Gemini

Extract text from PDF using PDF to Text
Parse JSON to get document text
Send to Google Gemini for analysis, summarization, or Q&A
Process AI-generated insights

Common Properties​

Inputs​

Options​

Output​

How It Works​

Requirements​

Error Handling​

Usage Notes​

Example Use Cases​

Extract Text from Invoice PDFs in GCS​

Batch Process Document Archive​

Process Scanned Documents​

Multi-Page PDF Processing​

Legal Document Analysis​

OCR for Searchable PDF Creation​

Understanding the Output JSON​

Tips​

Common Errors and Solutions​

Error: "GCS Source URI cannot be empty"​

Error: "GCS Destination URI cannot be empty"​

Error: "Invalid GCS URI format"​

Error: "Permission denied"​

Error: "File not found"​

Error: "Operation timeout"​

Output file not found​

Working with Output Files​

Download and Parse Results​

Process Multiple Output Files​

Integration with Other Packages​

With Google Storage​

With Database​

With Google Gemini​

Common Properties

Inputs

Options

Output

How It Works

Requirements

Error Handling

Usage Notes

Example Use Cases

Extract Text from Invoice PDFs in GCS

Batch Process Document Archive

Process Scanned Documents

Multi-Page PDF Processing

Legal Document Analysis

OCR for Searchable PDF Creation

Understanding the Output JSON

Tips

Common Errors and Solutions

Error: "GCS Source URI cannot be empty"

Error: "GCS Destination URI cannot be empty"

Error: "Invalid GCS URI format"

Error: "Permission denied"

Error: "File not found"

Error: "Operation timeout"

Output file not found

Working with Output Files

Download and Parse Results

Process Multiple Output Files

Integration with Other Packages

With Google Storage

With Database

With Google Gemini