Google Speech
The Google Speech package provides powerful cloud-based speech recognition and synthesis capabilities, enabling you to convert speech to text and text to speech in your RPA workflows using Google Cloud's advanced AI models.
Overview
Google Cloud Speech services offer state-of-the-art machine learning models for audio processing. This package integrates both Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities, allowing you to build sophisticated voice-enabled automation workflows.
Key Features
Speech Recognition
- Convert audio files to text with high accuracy
- Support for multiple languages and dialects
- Automatic punctuation and formatting
- Word-level timestamps for precise audio alignment
- Multiple recognition models optimized for different audio sources
- Long-running recognition for audio files of any length
Speech Synthesis
- Generate natural-sounding speech from text
- Wide selection of voices across multiple languages
- Multiple audio formats (WAV, MP3, OGG Opus)
- SSML support for advanced speech control
- Customizable voice parameters (pitch, speed, volume)
- Studio-quality voices with neural network technology
Authentication
Both nodes require Google Cloud service account credentials with the appropriate API permissions:
- Speech-to-Text API: For audio transcription
- Text-to-Speech API: For speech synthesis
Your service account must have the necessary roles (e.g., "Cloud Speech Client") and the APIs must be enabled in your Google Cloud project.
Common Use Cases
Customer Service Automation
- Transcribe customer call recordings for analysis
- Generate automated voice responses and notifications
- Create voice-based IVR systems
- Process voicemail messages automatically
Content Creation
- Generate voiceovers for videos and presentations
- Create audio versions of written content
- Produce multilingual audio content
- Build accessible applications with voice output
Data Processing
- Extract data from voice recordings and interviews
- Transcribe meeting recordings for documentation
- Process audio surveys and feedback
- Create searchable transcripts from audio archives
Accessibility
- Convert written content to speech for visually impaired users
- Transcribe audio content for hearing impaired users
- Build voice-controlled automation workflows
- Create multilingual voice interfaces
Available Nodes
Speech Recognition
- Speech to Text: Convert audio files from Google Cloud Storage to text with advanced recognition features
Speech Synthesis
- Text to Speech: Convert text or SSML to natural-sounding speech audio files
Getting Started
-
Set up Google Cloud:
- Create a Google Cloud project
- Enable Speech-to-Text and/or Text-to-Speech APIs
- Create a service account with appropriate permissions
- Download the service account JSON key file
-
Configure Credentials:
- Add your service account credentials to Robomotion
- Use the credentials in the Speech nodes
-
For Speech-to-Text:
- Upload your audio file to Google Cloud Storage
- Use the Speech to Text node with the GCS URI
- Receive transcription results with confidence scores
-
For Text-to-Speech:
- Prepare your text content (plain or SSML)
- Select voice and audio format
- Generate speech and save to file
Supported Languages
Google Speech supports 100+ languages and variants including:
- English (US, UK, AU, IN, etc.)
- Spanish (ES, MX, US, etc.)
- French (FR, CA)
- German (DE)
- Chinese (Mandarin, Cantonese)
- Japanese
- Korean
- Arabic
- Hindi
- Portuguese (BR, PT)
- And many more...
See the Google Cloud Speech documentation for the complete list of supported languages.
Audio Format Requirements
Speech-to-Text
- Audio must be stored in Google Cloud Storage
- Supported formats: WAV (LINEAR16), FLAC, MP3, OGG Opus, AMR, SPEEX
- Recommended: WAV with 16kHz sample rate for best results
- Maximum audio length: 480 minutes
Text-to-Speech
- Output formats: WAV (LINEAR16), MP3, OGG Opus
- Configurable sample rates (8kHz to 48kHz)
- Default: 16kHz for optimal quality and file size
Pricing Considerations
Google Cloud Speech services are billed based on usage:
- Speech-to-Text: Charged per 15 seconds of audio processed
- Text-to-Speech: Charged per 1 million characters processed
Different pricing tiers apply for:
- Standard vs. premium voices (TTS)
- Different recognition models (STT)
- Enhanced models and features
See Google Cloud Pricing for detailed information.
Best Practices
For Speech Recognition
- Use appropriate audio sample rates (16kHz recommended)
- Store audio files in Google Cloud Storage for best performance
- Select the correct recognition model for your audio source
- Enable automatic punctuation for better readability
- Use language-specific models for improved accuracy
- Consider using enhanced models for noisy audio
For Speech Synthesis
- Use SSML for fine-grained control over pronunciation and timing
- Select voices that match your target audience and use case
- Choose appropriate audio formats based on quality vs. size needs
- Test different voices to find the most natural-sounding option
- Use Studio voices for highest quality output
- Cache generated audio files to reduce API calls
General
- Implement proper error handling for network issues
- Monitor API usage and costs
- Use regional endpoints when possible for lower latency
- Validate credentials before processing large batches
- Consider rate limits and quotas in production workflows
Tips for Effective Use
- Optimize Audio Quality: Higher quality input audio produces better transcription results
- Use Appropriate Models: Select models based on your audio source (video, phone, default)
- Language Selection: Always specify the correct language code for best accuracy
- Batch Processing: Process multiple files efficiently by reusing credentials
- Error Recovery: Implement retry logic for transient network errors
- Voice Selection: Test multiple voices to find the best fit for your content
- SSML Formatting: Use SSML tags to control emphasis, pauses, and pronunciation
- File Management: Clean up temporary audio files after processing
Limitations and Quotas
Speech-to-Text
- Maximum audio length: 480 minutes (8 hours)
- Long-running recognition for files over 1 minute
- Synchronous recognition limited to 1 minute audio
- Rate limits apply per project
Text-to-Speech
- Maximum input text: 5,000 characters per request
- Maximum SSML length: 5,000 characters
- Rate limits apply per project
- Some voices have usage restrictions
Common Errors and Solutions
Authentication Errors
- Invalid credentials: Verify JSON key file is complete and valid
- Permission denied: Ensure service account has required API roles
- API not enabled: Enable Speech-to-Text/Text-to-Speech API in Cloud Console
Speech-to-Text Errors
- Invalid GCS URI: Verify the gs:// path is correct and accessible
- Unsupported audio format: Convert audio to supported format
- Sample rate mismatch: Ensure configured rate matches actual audio
- File not found: Verify service account has Storage Object Viewer role
Text-to-Speech Errors
- Invalid voice name: Check voice is available for selected language
- Text too long: Split large texts into smaller chunks
- Invalid language code: Use correct BCP-47 language codes
- File write error: Verify output path is writable
Resources
- Google Cloud Speech-to-Text Documentation
- Google Cloud Text-to-Speech Documentation
- Supported Languages
- Available Voices
- SSML Reference
- Pricing Calculator
📄️ Speech to Text
Robomotion.GoogleSpeech.SpeechToText
📄️ Text To Speech
Robomotion.GoogleSpeech.TextToSpeech