Speech to Text
Converts speech audio files stored in Google Cloud Storage to text using Google Cloud Speech-to-Text API with advanced recognition features and multiple language support.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Google Cloud Storage (GCS) URI - The Google Cloud Storage URI of the audio file to transcribe. Must be in the format
gs://bucket_name/path/to/file.wav. The audio file must be accessible by the service account.
Output
- Result - An array of speech recognition results containing:
- Transcribed text for each segment
- Confidence scores (0.0 to 1.0)
- Alternative transcriptions (if available)
- Word-level timing information (if enabled)
- Channel tags (for multi-channel audio)
Options
Audio Configuration
- Sample Rate - The sample rate (in Hertz) of the audio data. Must match the actual sample rate of the audio file. Default is 16000 Hz. Common values:
- 8000 - Telephone quality
- 16000 - Wideband quality (recommended)
- 32000 - High quality
- 44100 - CD quality
- 48000 - Professional quality
Language and Model
-
Language Code - The BCP-47 language code for the audio language. Default is
en-US. Examples:en-US- English (United States)en-GB- English (United Kingdom)es-ES- Spanish (Spain)fr-FR- French (France)de-DE- German (Germany)ja-JP- Japanesezh-CN- Chinese (Simplified)tr-TR- Turkish
-
Model - The speech recognition model optimized for different audio sources:
- Default - Best for most scenarios with clear audio
- Video - Optimized for audio from video recordings
- Phone Call - Optimized for audio from phone calls (narrowband)
- Command And Search - Best for short queries, voice commands, or voice search
Recognition Features
-
Automatic Punctuation - When enabled, automatically adds punctuation (periods, commas, question marks) to the transcriptions. Improves readability but may not be available for all languages. Default is false.
-
Enable Word Time Offsets - When enabled, provides word-level start and end timestamps in the results. Useful for:
- Creating subtitles and captions
- Aligning text with audio playback
- Building interactive transcripts
- Detecting specific words in timeline Default is false.
Authentication
- Credentials - Google Cloud service account credentials in JSON format. The service account must have:
- Cloud Speech-to-Text API enabled
- Storage Object Viewer role (to access GCS files)
- Appropriate permissions for the bucket containing the audio file
How It Works
The Speech to Text node uses Google Cloud's long-running recognition method to convert audio to text. When executed:
- Validates the GCS URI format and ensures it's not empty
- Authenticates with Google Cloud Speech-to-Text API using provided credentials
- Configures recognition parameters:
- Sets audio encoding to LINEAR16 (WAV format)
- Applies specified sample rate and language code
- Selects the appropriate recognition model
- Enables requested features (punctuation, word offsets)
- Submits long-running recognition request to Google Cloud
- Waits for recognition operation to complete
- Returns detailed transcription results with metadata
Requirements
- Valid Google Cloud Storage URI (gs://bucket/file.wav)
- Audio file in supported format (WAV/LINEAR16 recommended)
- Google Cloud service account with Speech-to-Text API enabled
- Service account with Storage Object Viewer permissions
- Internet connectivity to Google Cloud services
Error Handling
The node returns specific errors for common issues:
-
ErrInvalidArg:
- Empty or missing GCS URI
- Invalid sample rate value
- Invalid language code
- Recognition operation failed
-
ErrCredentials:
- Invalid or missing credentials
- Malformed JSON credentials
- Missing credential content
- Insufficient API permissions
Common Google Cloud API errors:
- Invalid GCS URI: Check the gs:// path format and bucket name
- File not found: Verify file exists and service account has access
- Permission denied: Ensure service account has Storage Object Viewer role
- Invalid audio format: Convert audio to supported format
- Sample rate mismatch: Set sample rate to match actual audio
- Language not supported: Use a supported language code
- API not enabled: Enable Speech-to-Text API in Cloud Console
Example Use Cases
Transcribe Customer Service Calls
Input:
GCS URI: gs://call-recordings/customer-call-2024-01-15.wav
Sample Rate: 8000
Language Code: en-US
Model: Phone Call
Automatic Punctuation: true
Use Case: Convert recorded phone conversations to searchable text for quality
assurance and customer insights.
Generate Video Subtitles
Input:
GCS URI: gs://video-audio/training-video-audio.wav
Sample Rate: 48000
Language Code: en-US
Model: Video
Enable Word Time Offsets: true
Automatic Punctuation: true
Use Case: Create accurate subtitles with precise timing for training videos
and educational content.
Transcribe Meeting Recordings
Input:
GCS URI: gs://meetings/board-meeting-2024-q1.wav
Sample Rate: 16000
Language Code: en-US
Model: Default
Automatic Punctuation: true
Use Case: Convert meeting recordings to text for minutes, action items,
and documentation.
Voice Command Processing
Input:
GCS URI: gs://voice-commands/user-query-12345.wav
Sample Rate: 16000
Language Code: en-US
Model: Command And Search
Use Case: Quickly transcribe short voice commands or search queries from
voice-enabled applications.
Multilingual Content Transcription
Input:
GCS URI: gs://podcasts/spanish-episode-15.wav
Sample Rate: 44100
Language Code: es-ES
Model: Default
Automatic Punctuation: true
Use Case: Create transcripts for podcasts and audio content in multiple languages.
Usage Notes
Audio File Requirements
- Audio must be stored in Google Cloud Storage (not local files)
- Supported formats: WAV (LINEAR16), FLAC, MP3, OGG Opus, AMR, SPEEX
- The node uses LINEAR16 encoding, so WAV format is recommended
- Maximum audio duration: 480 minutes (8 hours)
- For audio over 1 minute, long-running recognition is used automatically
Sample Rate Guidelines
- The sample rate MUST match your audio file's actual sample rate
- Mismatched sample rates result in poor transcription quality
- Use
ffprobeor similar tools to check actual sample rate - Common mistake: setting 16000 Hz for 8000 Hz phone call audio
Language Code Selection
- Always specify the correct language for best accuracy
- Language codes are case-sensitive (use
en-US, noten-us) - Some languages support multiple variants (en-US, en-GB, en-AU)
- Automatic language detection is not available - you must specify
Model Selection Best Practices
- Default: Use for general audio with clear speech
- Video: Use for audio extracted from video files
- Phone Call: Use for telephony audio (8kHz, narrowband)
- Command And Search: Use for short utterances under 30 seconds
Automatic Punctuation
- Improves readability of transcripts significantly
- Not available for all languages (check Google Cloud docs)
- May slightly increase processing time
- Recommended for most use cases
Word Time Offsets
- Adds start and end time for each word in the transcript
- Useful for subtitle generation and audio alignment
- Increases response size and processing time slightly
- Results include timing in seconds from audio start
Cost Optimization
- Transcription is billed per 15 seconds of audio
- Using appropriate models can improve accuracy and reduce costs
- Consider batching multiple short files together
- Cache transcription results to avoid re-processing
Output Structure
The Result output contains an array of recognition results:
[
{
"alternatives": [
{
"transcript": "hello world this is a test",
"confidence": 0.95,
"words": [ // Only if word time offsets enabled
{
"word": "hello",
"startTime": "0s",
"endTime": "0.5s"
},
// ... more words
]
}
// ... alternative transcriptions
]
}
// ... more results for different segments
]
Access the transcription in your flow:
// Get the first (most confident) transcription
const transcript = $.result[0].alternatives[0].transcript;
// Get confidence score
const confidence = $.result[0].alternatives[0].confidence;
// Access word timings (if enabled)
const words = $.result[0].alternatives[0].words;
Tips for Best Results
-
Audio Quality: Higher quality audio produces better transcriptions
- Reduce background noise
- Use proper microphone placement
- Avoid audio compression artifacts
-
Sample Rate Matching: Always match the node's sample rate to your audio file
- Check actual audio properties before transcription
- Don't assume sample rate - verify it
-
Language Selection: Specify the correct language for best accuracy
- Use language-specific models when available
- Consider regional variants (en-US vs. en-GB)
-
Model Selection: Choose the right model for your audio source
- Phone call audio should use "Phone Call" model
- Video audio should use "Video" model
-
Enable Punctuation: Unless you have specific reasons not to
- Makes transcripts much more readable
- Helps with downstream text processing
-
Use Word Offsets: For subtitle/caption generation
- Essential for video subtitling
- Helpful for searchable audio players
-
Error Handling: Implement retry logic
- Network issues can cause transient failures
- Use exponential backoff for retries
-
Batch Processing: For multiple files
- Reuse credentials across multiple nodes
- Process files in parallel when possible
-
Monitor Costs: Track usage in Google Cloud Console
- Set up billing alerts
- Use appropriate models to optimize costs
-
Test First: Always test with sample audio
- Verify settings produce good results
- Check confidence scores are acceptable
Common Errors and Solutions
"Google Cloud Storage (GCS) URI cannot be empty"
- Cause: No GCS URI provided in input
- Solution: Provide a valid gs:// URI pointing to your audio file
"No Credential Content"
- Cause: Credentials not properly configured
- Solution: Verify service account JSON is complete and valid
"Failed to recognize audio"
- Cause: Various API or audio format issues
- Solution:
- Check audio file format is supported
- Verify sample rate matches audio file
- Ensure service account has file access
- Check API is enabled in Cloud Console
Low Confidence Scores
- Cause: Poor audio quality or incorrect settings
- Solution:
- Improve audio quality (reduce noise)
- Verify correct language code
- Use appropriate recognition model
- Check sample rate matches audio
"Permission Denied" accessing GCS
- Cause: Service account lacks storage permissions
- Solution: Grant "Storage Object Viewer" role to service account
Empty or Missing Transcription
- Cause: No speech detected in audio
- Solution:
- Verify audio file contains speech
- Check audio isn't corrupted
- Ensure correct sample rate