Speech to Text

Converts speech audio files stored in Google Cloud Storage to text using Google Cloud Speech-to-Text API with advanced recognition features and multiple language support.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Google Cloud Storage (GCS) URI - The Google Cloud Storage URI of the audio file to transcribe. Must be in the format gs://bucket_name/path/to/file.wav. The audio file must be accessible by the service account.

Output

Result - An array of speech recognition results containing:
- Transcribed text for each segment
- Confidence scores (0.0 to 1.0)
- Alternative transcriptions (if available)
- Word-level timing information (if enabled)
- Channel tags (for multi-channel audio)

Options

Audio Configuration

Sample Rate - The sample rate (in Hertz) of the audio data. Must match the actual sample rate of the audio file. Default is 16000 Hz. Common values:
- 8000 - Telephone quality
- 16000 - Wideband quality (recommended)
- 32000 - High quality
- 44100 - CD quality
- 48000 - Professional quality

Language and Model

Language Code - The BCP-47 language code for the audio language. Default is en-US. Examples:
- en-US - English (United States)
- en-GB - English (United Kingdom)
- es-ES - Spanish (Spain)
- fr-FR - French (France)
- de-DE - German (Germany)
- ja-JP - Japanese
- zh-CN - Chinese (Simplified)
- tr-TR - Turkish
Model - The speech recognition model optimized for different audio sources:
- Default - Best for most scenarios with clear audio
- Video - Optimized for audio from video recordings
- Phone Call - Optimized for audio from phone calls (narrowband)
- Command And Search - Best for short queries, voice commands, or voice search

Recognition Features

Automatic Punctuation - When enabled, automatically adds punctuation (periods, commas, question marks) to the transcriptions. Improves readability but may not be available for all languages. Default is false.
Enable Word Time Offsets - When enabled, provides word-level start and end timestamps in the results. Useful for:
- Creating subtitles and captions
- Aligning text with audio playback
- Building interactive transcripts
- Detecting specific words in timeline Default is false.

Authentication

Credentials - Google Cloud service account credentials in JSON format. The service account must have:
- Cloud Speech-to-Text API enabled
- Storage Object Viewer role (to access GCS files)
- Appropriate permissions for the bucket containing the audio file

How It Works

The Speech to Text node uses Google Cloud's long-running recognition method to convert audio to text. When executed:

Validates the GCS URI format and ensures it's not empty
Authenticates with Google Cloud Speech-to-Text API using provided credentials
Configures recognition parameters:
- Sets audio encoding to LINEAR16 (WAV format)
- Applies specified sample rate and language code
- Selects the appropriate recognition model
- Enables requested features (punctuation, word offsets)
Submits long-running recognition request to Google Cloud
Waits for recognition operation to complete
Returns detailed transcription results with metadata

Requirements

Valid Google Cloud Storage URI (gs://bucket/file.wav)
Audio file in supported format (WAV/LINEAR16 recommended)
Google Cloud service account with Speech-to-Text API enabled
Service account with Storage Object Viewer permissions
Internet connectivity to Google Cloud services

Error Handling

The node returns specific errors for common issues:

ErrInvalidArg:
- Empty or missing GCS URI
- Invalid sample rate value
- Invalid language code
- Recognition operation failed
ErrCredentials:
- Invalid or missing credentials
- Malformed JSON credentials
- Missing credential content
- Insufficient API permissions

Common Google Cloud API errors:

Invalid GCS URI: Check the gs:// path format and bucket name
File not found: Verify file exists and service account has access
Permission denied: Ensure service account has Storage Object Viewer role
Invalid audio format: Convert audio to supported format
Sample rate mismatch: Set sample rate to match actual audio
Language not supported: Use a supported language code
API not enabled: Enable Speech-to-Text API in Cloud Console

Example Use Cases

Transcribe Customer Service Calls

Input:
  GCS URI: gs://call-recordings/customer-call-2024-01-15.wav
  Sample Rate: 8000
  Language Code: en-US
  Model: Phone Call
  Automatic Punctuation: true

Use Case: Convert recorded phone conversations to searchable text for quality
assurance and customer insights.

Generate Video Subtitles

Input:
  GCS URI: gs://video-audio/training-video-audio.wav
  Sample Rate: 48000
  Language Code: en-US
  Model: Video
  Enable Word Time Offsets: true
  Automatic Punctuation: true

Use Case: Create accurate subtitles with precise timing for training videos
and educational content.

Transcribe Meeting Recordings

Input:
  GCS URI: gs://meetings/board-meeting-2024-q1.wav
  Sample Rate: 16000
  Language Code: en-US
  Model: Default
  Automatic Punctuation: true

Use Case: Convert meeting recordings to text for minutes, action items,
and documentation.

Voice Command Processing

Input:
  GCS URI: gs://voice-commands/user-query-12345.wav
  Sample Rate: 16000
  Language Code: en-US
  Model: Command And Search

Use Case: Quickly transcribe short voice commands or search queries from
voice-enabled applications.

Multilingual Content Transcription

Input:
  GCS URI: gs://podcasts/spanish-episode-15.wav
  Sample Rate: 44100
  Language Code: es-ES
  Model: Default
  Automatic Punctuation: true

Use Case: Create transcripts for podcasts and audio content in multiple languages.

Usage Notes

Audio File Requirements

Audio must be stored in Google Cloud Storage (not local files)
Supported formats: WAV (LINEAR16), FLAC, MP3, OGG Opus, AMR, SPEEX
The node uses LINEAR16 encoding, so WAV format is recommended
Maximum audio duration: 480 minutes (8 hours)
For audio over 1 minute, long-running recognition is used automatically

Sample Rate Guidelines

The sample rate MUST match your audio file's actual sample rate
Mismatched sample rates result in poor transcription quality
Use ffprobe or similar tools to check actual sample rate
Common mistake: setting 16000 Hz for 8000 Hz phone call audio

Language Code Selection

Always specify the correct language for best accuracy
Language codes are case-sensitive (use en-US, not en-us)
Some languages support multiple variants (en-US, en-GB, en-AU)
Automatic language detection is not available - you must specify

Model Selection Best Practices

Default: Use for general audio with clear speech
Video: Use for audio extracted from video files
Phone Call: Use for telephony audio (8kHz, narrowband)
Command And Search: Use for short utterances under 30 seconds

Automatic Punctuation

Improves readability of transcripts significantly
Not available for all languages (check Google Cloud docs)
May slightly increase processing time
Recommended for most use cases

Word Time Offsets

Adds start and end time for each word in the transcript
Useful for subtitle generation and audio alignment
Increases response size and processing time slightly
Results include timing in seconds from audio start

Cost Optimization

Transcription is billed per 15 seconds of audio
Using appropriate models can improve accuracy and reduce costs
Consider batching multiple short files together
Cache transcription results to avoid re-processing

Output Structure

The Result output contains an array of recognition results:

[
  {
    "alternatives": [
      {
        "transcript": "hello world this is a test",
        "confidence": 0.95,
        "words": [  // Only if word time offsets enabled
          {
            "word": "hello",
            "startTime": "0s",
            "endTime": "0.5s"
          },
          // ... more words
        ]
      }
      // ... alternative transcriptions
    ]
  }
  // ... more results for different segments
]

Access the transcription in your flow:

// Get the first (most confident) transcription
const transcript = $.result[0].alternatives[0].transcript;

// Get confidence score
const confidence = $.result[0].alternatives[0].confidence;

// Access word timings (if enabled)
const words = $.result[0].alternatives[0].words;

Tips for Best Results

Audio Quality: Higher quality audio produces better transcriptions
- Reduce background noise
- Use proper microphone placement
- Avoid audio compression artifacts
Sample Rate Matching: Always match the node's sample rate to your audio file
- Check actual audio properties before transcription
- Don't assume sample rate - verify it
Language Selection: Specify the correct language for best accuracy
- Use language-specific models when available
- Consider regional variants (en-US vs. en-GB)
Model Selection: Choose the right model for your audio source
- Phone call audio should use "Phone Call" model
- Video audio should use "Video" model
Enable Punctuation: Unless you have specific reasons not to
- Makes transcripts much more readable
- Helps with downstream text processing
Use Word Offsets: For subtitle/caption generation
- Essential for video subtitling
- Helpful for searchable audio players
Error Handling: Implement retry logic
- Network issues can cause transient failures
- Use exponential backoff for retries
Batch Processing: For multiple files
- Reuse credentials across multiple nodes
- Process files in parallel when possible
Monitor Costs: Track usage in Google Cloud Console
- Set up billing alerts
- Use appropriate models to optimize costs
Test First: Always test with sample audio
- Verify settings produce good results
- Check confidence scores are acceptable

Common Errors and Solutions

"Google Cloud Storage (GCS) URI cannot be empty"

Cause: No GCS URI provided in input
Solution: Provide a valid gs:// URI pointing to your audio file

"No Credential Content"

Cause: Credentials not properly configured
Solution: Verify service account JSON is complete and valid

"Failed to recognize audio"

Cause: Various API or audio format issues
Solution:
- Check audio file format is supported
- Verify sample rate matches audio file
- Ensure service account has file access
- Check API is enabled in Cloud Console

Low Confidence Scores

Cause: Poor audio quality or incorrect settings
Solution:
- Improve audio quality (reduce noise)
- Verify correct language code
- Use appropriate recognition model
- Check sample rate matches audio

"Permission Denied" accessing GCS

Cause: Service account lacks storage permissions
Solution: Grant "Storage Object Viewer" role to service account

Empty or Missing Transcription

Cause: No speech detected in audio
Solution:
- Verify audio file contains speech
- Check audio isn't corrupted
- Ensure correct sample rate

Common Properties​

Inputs​

Output​

Options​

Audio Configuration​

Language and Model​

Recognition Features​

Authentication​

How It Works​

Requirements​

Error Handling​

Example Use Cases​

Transcribe Customer Service Calls​

Generate Video Subtitles​

Transcribe Meeting Recordings​

Voice Command Processing​

Multilingual Content Transcription​

Usage Notes​

Audio File Requirements​

Sample Rate Guidelines​

Language Code Selection​

Model Selection Best Practices​

Automatic Punctuation​

Word Time Offsets​

Cost Optimization​

Output Structure​

Tips for Best Results​

Common Errors and Solutions​

"Google Cloud Storage (GCS) URI cannot be empty"​

"No Credential Content"​

"Failed to recognize audio"​

Low Confidence Scores​

"Permission Denied" accessing GCS​

Empty or Missing Transcription​

Related Resources​