Google Speech

The Google Speech package provides powerful cloud-based speech recognition and synthesis capabilities, enabling you to convert speech to text and text to speech in your RPA workflows using Google Cloud's advanced AI models.

Overview

Google Cloud Speech services offer state-of-the-art machine learning models for audio processing. This package integrates both Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities, allowing you to build sophisticated voice-enabled automation workflows.

Key Features

Speech Recognition

Convert audio files to text with high accuracy
Support for multiple languages and dialects
Automatic punctuation and formatting
Word-level timestamps for precise audio alignment
Multiple recognition models optimized for different audio sources
Long-running recognition for audio files of any length

Speech Synthesis

Generate natural-sounding speech from text
Wide selection of voices across multiple languages
Multiple audio formats (WAV, MP3, OGG Opus)
SSML support for advanced speech control
Customizable voice parameters (pitch, speed, volume)
Studio-quality voices with neural network technology

Authentication

Both nodes require Google Cloud service account credentials with the appropriate API permissions:

Speech-to-Text API: For audio transcription
Text-to-Speech API: For speech synthesis

Your service account must have the necessary roles (e.g., "Cloud Speech Client") and the APIs must be enabled in your Google Cloud project.

Common Use Cases

Customer Service Automation

Transcribe customer call recordings for analysis
Generate automated voice responses and notifications
Create voice-based IVR systems
Process voicemail messages automatically

Content Creation

Generate voiceovers for videos and presentations
Create audio versions of written content
Produce multilingual audio content
Build accessible applications with voice output

Data Processing

Extract data from voice recordings and interviews
Transcribe meeting recordings for documentation
Process audio surveys and feedback
Create searchable transcripts from audio archives

Accessibility

Convert written content to speech for visually impaired users
Transcribe audio content for hearing impaired users
Build voice-controlled automation workflows
Create multilingual voice interfaces

Available Nodes

Speech Recognition

Speech to Text: Convert audio files from Google Cloud Storage to text with advanced recognition features

Speech Synthesis

Text to Speech: Convert text or SSML to natural-sounding speech audio files

Getting Started

Set up Google Cloud:
- Create a Google Cloud project
- Enable Speech-to-Text and/or Text-to-Speech APIs
- Create a service account with appropriate permissions
- Download the service account JSON key file
Configure Credentials:
- Add your service account credentials to Robomotion
- Use the credentials in the Speech nodes
For Speech-to-Text:
- Upload your audio file to Google Cloud Storage
- Use the Speech to Text node with the GCS URI
- Receive transcription results with confidence scores
For Text-to-Speech:
- Prepare your text content (plain or SSML)
- Select voice and audio format
- Generate speech and save to file

Supported Languages

Google Speech supports 100+ languages and variants including:

English (US, UK, AU, IN, etc.)
Spanish (ES, MX, US, etc.)
French (FR, CA)
German (DE)
Chinese (Mandarin, Cantonese)
Japanese
Korean
Arabic
Hindi
Portuguese (BR, PT)
And many more...

See the Google Cloud Speech documentation for the complete list of supported languages.

Audio Format Requirements

Speech-to-Text

Audio must be stored in Google Cloud Storage
Supported formats: WAV (LINEAR16), FLAC, MP3, OGG Opus, AMR, SPEEX
Recommended: WAV with 16kHz sample rate for best results
Maximum audio length: 480 minutes

Text-to-Speech

Output formats: WAV (LINEAR16), MP3, OGG Opus
Configurable sample rates (8kHz to 48kHz)
Default: 16kHz for optimal quality and file size

Pricing Considerations

Google Cloud Speech services are billed based on usage:

Speech-to-Text: Charged per 15 seconds of audio processed
Text-to-Speech: Charged per 1 million characters processed

Different pricing tiers apply for:

Standard vs. premium voices (TTS)
Different recognition models (STT)
Enhanced models and features

See Google Cloud Pricing for detailed information.

Best Practices

For Speech Recognition

Use appropriate audio sample rates (16kHz recommended)
Store audio files in Google Cloud Storage for best performance
Select the correct recognition model for your audio source
Enable automatic punctuation for better readability
Use language-specific models for improved accuracy
Consider using enhanced models for noisy audio

For Speech Synthesis

Use SSML for fine-grained control over pronunciation and timing
Select voices that match your target audience and use case
Choose appropriate audio formats based on quality vs. size needs
Test different voices to find the most natural-sounding option
Use Studio voices for highest quality output
Cache generated audio files to reduce API calls

General

Implement proper error handling for network issues
Monitor API usage and costs
Use regional endpoints when possible for lower latency
Validate credentials before processing large batches
Consider rate limits and quotas in production workflows

Tips for Effective Use

Optimize Audio Quality: Higher quality input audio produces better transcription results
Use Appropriate Models: Select models based on your audio source (video, phone, default)
Language Selection: Always specify the correct language code for best accuracy
Batch Processing: Process multiple files efficiently by reusing credentials
Error Recovery: Implement retry logic for transient network errors
Voice Selection: Test multiple voices to find the best fit for your content
SSML Formatting: Use SSML tags to control emphasis, pauses, and pronunciation
File Management: Clean up temporary audio files after processing

Limitations and Quotas

Speech-to-Text

Maximum audio length: 480 minutes (8 hours)
Long-running recognition for files over 1 minute
Synchronous recognition limited to 1 minute audio
Rate limits apply per project

Text-to-Speech

Maximum input text: 5,000 characters per request
Maximum SSML length: 5,000 characters
Rate limits apply per project
Some voices have usage restrictions

Common Errors and Solutions

Authentication Errors

Invalid credentials: Verify JSON key file is complete and valid
Permission denied: Ensure service account has required API roles
API not enabled: Enable Speech-to-Text/Text-to-Speech API in Cloud Console

Speech-to-Text Errors

Invalid GCS URI: Verify the gs:// path is correct and accessible
Unsupported audio format: Convert audio to supported format
Sample rate mismatch: Ensure configured rate matches actual audio
File not found: Verify service account has Storage Object Viewer role

Text-to-Speech Errors

Invalid voice name: Check voice is available for selected language
Text too long: Split large texts into smaller chunks
Invalid language code: Use correct BCP-47 language codes
File write error: Verify output path is writable

Resources

📄️ Speech to Text

Robomotion.GoogleSpeech.SpeechToText

📄️ Text To Speech

Robomotion.GoogleSpeech.TextToSpeech

Overview​

Key Features​

Speech Recognition​

Speech Synthesis​

Authentication​

Common Use Cases​

Customer Service Automation​

Content Creation​

Data Processing​

Accessibility​

Available Nodes​

Speech Recognition​

Speech Synthesis​

Getting Started​

Supported Languages​

Audio Format Requirements​

Speech-to-Text​

Text-to-Speech​

Pricing Considerations​

Best Practices​

For Speech Recognition​

For Speech Synthesis​

General​

Tips for Effective Use​

Limitations and Quotas​

Speech-to-Text​

Text-to-Speech​

Common Errors and Solutions​

Authentication Errors​

Speech-to-Text Errors​

Text-to-Speech Errors​

Resources​

📄️ Speech to Text

📄️ Text To Speech

Overview

Key Features

Speech Recognition

Speech Synthesis

Authentication

Common Use Cases

Customer Service Automation

Content Creation

Data Processing

Accessibility

Available Nodes

Speech Recognition

Speech Synthesis

Getting Started

Supported Languages

Audio Format Requirements

Speech-to-Text

Text-to-Speech

Pricing Considerations

Best Practices

For Speech Recognition

For Speech Synthesis

General

Tips for Effective Use

Limitations and Quotas

Speech-to-Text

Text-to-Speech

Common Errors and Solutions

Authentication Errors

Speech-to-Text Errors

Text-to-Speech Errors

Resources