Skip to main content

Speech to Text

Converts speech audio files stored in Google Cloud Storage to text using Google Cloud Speech-to-Text API with advanced recognition features and multiple language support.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Google Cloud Storage (GCS) URI - The Google Cloud Storage URI of the audio file to transcribe. Must be in the format gs://bucket_name/path/to/file.wav. The audio file must be accessible by the service account.

Output

  • Result - An array of speech recognition results containing:
    • Transcribed text for each segment
    • Confidence scores (0.0 to 1.0)
    • Alternative transcriptions (if available)
    • Word-level timing information (if enabled)
    • Channel tags (for multi-channel audio)

Options

Audio Configuration

  • Sample Rate - The sample rate (in Hertz) of the audio data. Must match the actual sample rate of the audio file. Default is 16000 Hz. Common values:
    • 8000 - Telephone quality
    • 16000 - Wideband quality (recommended)
    • 32000 - High quality
    • 44100 - CD quality
    • 48000 - Professional quality

Language and Model

  • Language Code - The BCP-47 language code for the audio language. Default is en-US. Examples:

    • en-US - English (United States)
    • en-GB - English (United Kingdom)
    • es-ES - Spanish (Spain)
    • fr-FR - French (France)
    • de-DE - German (Germany)
    • ja-JP - Japanese
    • zh-CN - Chinese (Simplified)
    • tr-TR - Turkish
  • Model - The speech recognition model optimized for different audio sources:

    • Default - Best for most scenarios with clear audio
    • Video - Optimized for audio from video recordings
    • Phone Call - Optimized for audio from phone calls (narrowband)
    • Command And Search - Best for short queries, voice commands, or voice search

Recognition Features

  • Automatic Punctuation - When enabled, automatically adds punctuation (periods, commas, question marks) to the transcriptions. Improves readability but may not be available for all languages. Default is false.

  • Enable Word Time Offsets - When enabled, provides word-level start and end timestamps in the results. Useful for:

    • Creating subtitles and captions
    • Aligning text with audio playback
    • Building interactive transcripts
    • Detecting specific words in timeline Default is false.

Authentication

  • Credentials - Google Cloud service account credentials in JSON format. The service account must have:
    • Cloud Speech-to-Text API enabled
    • Storage Object Viewer role (to access GCS files)
    • Appropriate permissions for the bucket containing the audio file

How It Works

The Speech to Text node uses Google Cloud's long-running recognition method to convert audio to text. When executed:

  1. Validates the GCS URI format and ensures it's not empty
  2. Authenticates with Google Cloud Speech-to-Text API using provided credentials
  3. Configures recognition parameters:
    • Sets audio encoding to LINEAR16 (WAV format)
    • Applies specified sample rate and language code
    • Selects the appropriate recognition model
    • Enables requested features (punctuation, word offsets)
  4. Submits long-running recognition request to Google Cloud
  5. Waits for recognition operation to complete
  6. Returns detailed transcription results with metadata

Requirements

  • Valid Google Cloud Storage URI (gs://bucket/file.wav)
  • Audio file in supported format (WAV/LINEAR16 recommended)
  • Google Cloud service account with Speech-to-Text API enabled
  • Service account with Storage Object Viewer permissions
  • Internet connectivity to Google Cloud services

Error Handling

The node returns specific errors for common issues:

  • ErrInvalidArg:

    • Empty or missing GCS URI
    • Invalid sample rate value
    • Invalid language code
    • Recognition operation failed
  • ErrCredentials:

    • Invalid or missing credentials
    • Malformed JSON credentials
    • Missing credential content
    • Insufficient API permissions

Common Google Cloud API errors:

  • Invalid GCS URI: Check the gs:// path format and bucket name
  • File not found: Verify file exists and service account has access
  • Permission denied: Ensure service account has Storage Object Viewer role
  • Invalid audio format: Convert audio to supported format
  • Sample rate mismatch: Set sample rate to match actual audio
  • Language not supported: Use a supported language code
  • API not enabled: Enable Speech-to-Text API in Cloud Console

Example Use Cases

Transcribe Customer Service Calls

Input:
GCS URI: gs://call-recordings/customer-call-2024-01-15.wav
Sample Rate: 8000
Language Code: en-US
Model: Phone Call
Automatic Punctuation: true

Use Case: Convert recorded phone conversations to searchable text for quality
assurance and customer insights.

Generate Video Subtitles

Input:
GCS URI: gs://video-audio/training-video-audio.wav
Sample Rate: 48000
Language Code: en-US
Model: Video
Enable Word Time Offsets: true
Automatic Punctuation: true

Use Case: Create accurate subtitles with precise timing for training videos
and educational content.

Transcribe Meeting Recordings

Input:
GCS URI: gs://meetings/board-meeting-2024-q1.wav
Sample Rate: 16000
Language Code: en-US
Model: Default
Automatic Punctuation: true

Use Case: Convert meeting recordings to text for minutes, action items,
and documentation.

Voice Command Processing

Input:
GCS URI: gs://voice-commands/user-query-12345.wav
Sample Rate: 16000
Language Code: en-US
Model: Command And Search

Use Case: Quickly transcribe short voice commands or search queries from
voice-enabled applications.

Multilingual Content Transcription

Input:
GCS URI: gs://podcasts/spanish-episode-15.wav
Sample Rate: 44100
Language Code: es-ES
Model: Default
Automatic Punctuation: true

Use Case: Create transcripts for podcasts and audio content in multiple languages.

Usage Notes

Audio File Requirements

  • Audio must be stored in Google Cloud Storage (not local files)
  • Supported formats: WAV (LINEAR16), FLAC, MP3, OGG Opus, AMR, SPEEX
  • The node uses LINEAR16 encoding, so WAV format is recommended
  • Maximum audio duration: 480 minutes (8 hours)
  • For audio over 1 minute, long-running recognition is used automatically

Sample Rate Guidelines

  • The sample rate MUST match your audio file's actual sample rate
  • Mismatched sample rates result in poor transcription quality
  • Use ffprobe or similar tools to check actual sample rate
  • Common mistake: setting 16000 Hz for 8000 Hz phone call audio

Language Code Selection

  • Always specify the correct language for best accuracy
  • Language codes are case-sensitive (use en-US, not en-us)
  • Some languages support multiple variants (en-US, en-GB, en-AU)
  • Automatic language detection is not available - you must specify

Model Selection Best Practices

  • Default: Use for general audio with clear speech
  • Video: Use for audio extracted from video files
  • Phone Call: Use for telephony audio (8kHz, narrowband)
  • Command And Search: Use for short utterances under 30 seconds

Automatic Punctuation

  • Improves readability of transcripts significantly
  • Not available for all languages (check Google Cloud docs)
  • May slightly increase processing time
  • Recommended for most use cases

Word Time Offsets

  • Adds start and end time for each word in the transcript
  • Useful for subtitle generation and audio alignment
  • Increases response size and processing time slightly
  • Results include timing in seconds from audio start

Cost Optimization

  • Transcription is billed per 15 seconds of audio
  • Using appropriate models can improve accuracy and reduce costs
  • Consider batching multiple short files together
  • Cache transcription results to avoid re-processing

Output Structure

The Result output contains an array of recognition results:

[
{
"alternatives": [
{
"transcript": "hello world this is a test",
"confidence": 0.95,
"words": [ // Only if word time offsets enabled
{
"word": "hello",
"startTime": "0s",
"endTime": "0.5s"
},
// ... more words
]
}
// ... alternative transcriptions
]
}
// ... more results for different segments
]

Access the transcription in your flow:

// Get the first (most confident) transcription
const transcript = $.result[0].alternatives[0].transcript;

// Get confidence score
const confidence = $.result[0].alternatives[0].confidence;

// Access word timings (if enabled)
const words = $.result[0].alternatives[0].words;

Tips for Best Results

  1. Audio Quality: Higher quality audio produces better transcriptions

    • Reduce background noise
    • Use proper microphone placement
    • Avoid audio compression artifacts
  2. Sample Rate Matching: Always match the node's sample rate to your audio file

    • Check actual audio properties before transcription
    • Don't assume sample rate - verify it
  3. Language Selection: Specify the correct language for best accuracy

    • Use language-specific models when available
    • Consider regional variants (en-US vs. en-GB)
  4. Model Selection: Choose the right model for your audio source

    • Phone call audio should use "Phone Call" model
    • Video audio should use "Video" model
  5. Enable Punctuation: Unless you have specific reasons not to

    • Makes transcripts much more readable
    • Helps with downstream text processing
  6. Use Word Offsets: For subtitle/caption generation

    • Essential for video subtitling
    • Helpful for searchable audio players
  7. Error Handling: Implement retry logic

    • Network issues can cause transient failures
    • Use exponential backoff for retries
  8. Batch Processing: For multiple files

    • Reuse credentials across multiple nodes
    • Process files in parallel when possible
  9. Monitor Costs: Track usage in Google Cloud Console

    • Set up billing alerts
    • Use appropriate models to optimize costs
  10. Test First: Always test with sample audio

    • Verify settings produce good results
    • Check confidence scores are acceptable

Common Errors and Solutions

"Google Cloud Storage (GCS) URI cannot be empty"

  • Cause: No GCS URI provided in input
  • Solution: Provide a valid gs:// URI pointing to your audio file

"No Credential Content"

  • Cause: Credentials not properly configured
  • Solution: Verify service account JSON is complete and valid

"Failed to recognize audio"

  • Cause: Various API or audio format issues
  • Solution:
    • Check audio file format is supported
    • Verify sample rate matches audio file
    • Ensure service account has file access
    • Check API is enabled in Cloud Console

Low Confidence Scores

  • Cause: Poor audio quality or incorrect settings
  • Solution:
    • Improve audio quality (reduce noise)
    • Verify correct language code
    • Use appropriate recognition model
    • Check sample rate matches audio

"Permission Denied" accessing GCS

  • Cause: Service account lacks storage permissions
  • Solution: Grant "Storage Object Viewer" role to service account

Empty or Missing Transcription

  • Cause: No speech detected in audio
  • Solution:
    • Verify audio file contains speech
    • Check audio isn't corrupted
    • Ensure correct sample rate