Text To Speech
Converts text or SSML content to natural-sounding speech audio using Google Cloud Text-to-Speech API with support for multiple languages, voices, and audio formats.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
-
Text - The text or SSML content to convert to speech. Supports both plain text and SSML (Speech Synthesis Markup Language) formats for enhanced control. Maximum length: 5,000 characters.
-
Path - The file path where the generated audio file will be saved. If empty, the audio will be saved to a temporary file with the appropriate extension. The path should be absolute and writable.
Output
- Path - The absolute file path where the generated audio file was saved. Use this path to access, play, or further process the audio file in subsequent nodes.
Options
Audio Configuration
-
Audio Encoding - The output audio format for the generated speech. Default is WAV.
- Wav - Uncompressed LINEAR16 PCM audio. Highest quality, largest file size. Best for:
- Professional audio production
- Further audio processing
- Maximum compatibility
- Mp3 - Compressed MP3 format. Good quality, smaller file size. Best for:
- Web applications
- Storage efficiency
- Email attachments
- Ogg Opus - Compressed OGG Opus format. Excellent quality with efficient compression. Best for:
- Web streaming
- Real-time applications
- Low bandwidth scenarios
- Wav - Uncompressed LINEAR16 PCM audio. Highest quality, largest file size. Best for:
-
Sample Rate - The sample rate (in Hertz) for the generated audio. Default is 16000 Hz. Higher sample rates provide better quality but larger files. Common values:
- 8000 - Telephone quality, smallest files
- 16000 - Wideband quality (recommended default)
- 22050 - FM radio quality
- 24000 - High quality
- 32000 - Very high quality
- 44100 - CD quality
- 48000 - Professional quality
Voice Configuration
-
Language Code - The BCP-47 language code for speech synthesis. Default is
en-US. This determines the language and regional variant. Examples:en-US- English (United States)en-GB- English (United Kingdom)en-AU- English (Australia)es-ES- Spanish (Spain)es-US- Spanish (United States)fr-FR- French (France)de-DE- German (Germany)ja-JP- Japaneseko-KR- Koreanzh-CN- Chinese (Simplified)tr-TR- Turkishpt-BR- Portuguese (Brazil)
-
Voice Name - The specific voice to use for synthesis. Default is
en-US-Studio-M. Voice names follow the pattern:{languageCode}-{voiceType}-{gender}. Examples:en-US-Studio-M- Male Studio voice (US English)en-US-Studio-F- Female Studio voice (US English)en-US-Neural2-A- Neural voice A (US English)en-US-Wavenet-A- WaveNet voice A (US English)es-ES-Standard-A- Standard voice A (Spanish)fr-FR-Studio-D- Studio voice D (French)
Voice types:
- Studio - Highest quality, most natural (recommended)
- Neural2 - High quality neural voices
- Wavenet - Advanced neural voices
- Standard - Basic quality voices
Advanced Options
- SSML Text - Enable if the input text is in SSML (Speech Synthesis Markup Language) format. Default is false. When enabled:
- Allows fine-grained control over pronunciation, emphasis, pauses, and pitch
- Supports audio effects and custom dictionaries
- Requires well-formed SSML XML structure
- See SSML Reference for details
Authentication
- Credentials - Google Cloud service account credentials in JSON format. The service account must have:
- Cloud Text-to-Speech API enabled
- Text-to-Speech Client role or equivalent permissions
- Active billing account associated with the project
How It Works
The Text to Speech node generates natural-sounding audio from text using Google Cloud's advanced neural networks. When executed:
- Validates the input text is not empty (max 5,000 characters)
- Authenticates with Google Cloud Text-to-Speech API using provided credentials
- Configures synthesis parameters:
- Selects the specified voice (language code + voice name)
- Sets audio encoding (WAV, MP3, or OGG Opus)
- Applies the specified sample rate
- Processes input as plain text or SSML
- Sends synthesis request to Google Cloud
- Receives generated audio content
- Saves audio to the specified path (or temporary file if no path provided)
- Returns the file path for further processing
Requirements
- Valid Google Cloud credentials with Text-to-Speech API enabled
- Non-empty text content (max 5,000 characters)
- Writable file path (or system temp directory access)
- Internet connectivity to Google Cloud services
Error Handling
The node returns specific errors for common issues:
-
ErrInvalidArg:
- Empty or missing text input
- Invalid audio encoding format
- Invalid sample rate
- Invalid language code or voice name
- Text exceeds 5,000 character limit
- Synthesis operation failed
-
ErrCredentials:
- Invalid or missing credentials
- Malformed JSON credentials
- Missing credential content
- Insufficient API permissions
-
ErrFilePath:
- Cannot create temporary file
- Invalid temporary directory
-
ErrWriteFile:
- Cannot write audio to file
- Path is not writable
- Disk space issues
Common Google Cloud API errors:
- Invalid voice name: Verify voice exists for specified language
- Invalid language code: Use correct BCP-47 language codes
- API not enabled: Enable Text-to-Speech API in Cloud Console
- Permission denied: Ensure service account has TTS Client role
- Quota exceeded: Check usage limits in Cloud Console
Example Use Cases
Generate Video Narration
Input:
Text: "Welcome to our training course on automation with Robomotion."
Path: /path/to/videos/narration.mp3
Options:
Audio Encoding: Mp3
Language Code: en-US
Voice Name: en-US-Studio-M
Sample Rate: 24000
Use Case: Create professional voiceovers for training videos and presentations.
Automated Phone System
Input:
Text: "Thank you for calling. Your call is important to us. Please hold."
Path: /path/to/ivr/hold-message.wav
Options:
Audio Encoding: Wav
Language Code: en-US
Voice Name: en-US-Neural2-F
Sample Rate: 8000
Use Case: Generate IVR messages and automated phone system prompts.
Multilingual Notifications
Input:
Text: "Su pedido ha sido enviado y llegará en 2 días."
Path: /notifications/shipping-es.mp3
Options:
Audio Encoding: Mp3
Language Code: es-ES
Voice Name: es-ES-Studio-A
Sample Rate: 16000
Use Case: Create audio notifications in multiple languages for global customers.
SSML Enhanced Speech
Input:
Text: |
<speak>
Hello <break time="500ms"/> and welcome.
<emphasis level="strong">This is important!</emphasis>
<prosody rate="slow" pitch="+2st">
Please listen carefully.
</prosody>
</speak>
Path: /audio/announcement.wav
Options:
Audio Encoding: Wav
SSML Text: true
Language Code: en-US
Voice Name: en-US-Studio-F
Sample Rate: 24000
Use Case: Create dynamic announcements with precise control over timing and emphasis.
Accessible Content
Input:
Text: "This is the text content from the article to be read aloud."
Path: /accessible/article-audio.mp3
Options:
Audio Encoding: Mp3
Language Code: en-US
Voice Name: en-US-Studio-M
Sample Rate: 22050
Use Case: Convert written content to audio for visually impaired users.
Email Notifications
Input:
Text: "You have a new message from customer support. Please check your inbox."
Options:
Audio Encoding: Mp3
Language Code: en-GB
Voice Name: en-GB-Neural2-A
Sample Rate: 16000
Use Case: Generate audio notifications and send as email attachments.
Note: Path is empty, so audio is saved to a temporary file.
Usage Notes
Text Input Requirements
- Maximum length: 5,000 characters
- For longer texts, split into multiple chunks
- Plain text and SSML are supported
- SSML must be well-formed XML
- Special characters should be properly escaped in SSML
Voice Selection Guidelines
-
Studio voices: Best quality, most natural sounding (recommended)
-
Neural2 voices: High quality, good naturalness
-
Wavenet voices: Advanced quality, natural prosody
-
Standard voices: Basic quality, lower cost
-
Voice availability varies by language
-
Not all voices support all features
-
Test different voices to find the best fit
-
Use language-appropriate voices (don't use en-US voice for Spanish text)
Audio Format Selection
-
WAV: Choose for maximum quality and compatibility
- No compression artifacts
- Best for further audio processing
- Larger file sizes
-
MP3: Choose for balanced quality and size
- Good compression efficiency
- Wide compatibility
- Smaller files than WAV
-
OGG Opus: Choose for streaming and web
- Best compression efficiency
- Excellent quality at low bitrates
- May have compatibility limitations
Sample Rate Guidelines
- Higher sample rates = better quality but larger files
- Match sample rate to your use case:
- 8000 Hz: Phone systems, IVR
- 16000 Hz: General use, voice assistants (recommended default)
- 22050-24000 Hz: High quality content, videos
- 44100-48000 Hz: Professional production, music
- Some voices may have sample rate limitations
SSML Features
Enable SSML to control:
- Pauses:
<break time="500ms"/> - Emphasis:
<emphasis level="strong">important</emphasis> - Prosody:
<prosody rate="slow" pitch="+2st">text</prosody> - Say As:
<say-as interpret-as="cardinal">123</say-as> - Phonemes:
<phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme> - Audio:
<audio src="https://example.com/sound.mp3">fallback</audio>
File Path Handling
- If Path is empty, file is saved to temporary directory
- Temporary files have unique names to avoid conflicts
- File extension matches the audio encoding (.wav, .mp3, .ogg)
- Ensure target directory exists and is writable
- Use absolute paths for reliability
Cost Optimization
- Text-to-Speech is billed per 1 million characters
- Standard voices cost less than Neural/Wavenet/Studio voices
- Cache generated audio files to reduce API calls
- Reuse audio for repeated content
- Monitor usage in Google Cloud Console
Language and Voice Matching
- Always match voice to language code
- Using mismatched voice/language produces errors
- Each language has different available voices
- Check Google Cloud voices for available options
Output Usage
The Path output contains the file system path to the generated audio:
// Access the audio file path
const audioPath = $.path;
// Use in subsequent nodes
// - Upload to cloud storage
// - Send via email
// - Play in application
// - Convert to different format
// - Analyze audio properties
Example flow:
- Text to Speech node generates audio → outputs path
- Upload File node uploads audio to cloud storage
- Send Email node attaches the audio file
- Delete File node cleans up temporary file
Tips for Best Results
-
Voice Selection: Test multiple voices to find the most natural option
- Studio voices provide the best quality
- Match voice gender to content context
- Consider regional accents for target audience
-
Text Preparation: Format text for optimal speech
- Use proper punctuation for natural pauses
- Spell out numbers, acronyms, and abbreviations
- Break long sentences into shorter segments
- Remove formatting characters (markdown, HTML, etc.)
-
SSML Usage: Leverage SSML for precise control
- Add pauses for emphasis and clarity
- Control pronunciation of difficult words
- Adjust speaking rate for complex content
- Use emphasis for key points
-
Audio Format: Choose based on use case
- WAV for production and editing
- MP3 for web and distribution
- OGG Opus for streaming applications
-
Sample Rate: Balance quality and file size
- 16kHz is sufficient for most voice content
- Use higher rates only when quality is critical
- Match rate to your audio infrastructure
-
Error Handling: Implement robust error recovery
- Validate text length before processing
- Check credentials before batch operations
- Handle temporary file cleanup
- Retry on transient network errors
-
Batch Processing: For multiple texts
- Reuse credentials across nodes
- Process in parallel when possible
- Implement rate limiting if needed
- Cache results when appropriate
-
Quality Control: Verify output quality
- Listen to generated samples
- Check pronunciation of technical terms
- Adjust SSML for natural flow
- Test with target audience
-
File Management: Handle generated files properly
- Use meaningful file names
- Organize by language/voice if needed
- Clean up temporary files
- Implement backup strategies
-
Localization: For multilingual content
- Use appropriate language codes
- Select culturally appropriate voices
- Test with native speakers
- Consider regional preferences
Common Errors and Solutions
"Input text cannot be empty"
- Cause: No text provided in the Text input
- Solution: Provide valid text content (1-5,000 characters)
"No Credential Content"
- Cause: Credentials not properly configured
- Solution: Verify service account JSON is complete and valid
"invalid encoding"
- Cause: Audio encoding option is invalid
- Solution: Use one of: wav, mp3, ogg
"Failed to synthesize speech"
- Cause: Various API or configuration issues
- Solution:
- Verify language code is valid (BCP-47 format)
- Check voice name exists for specified language
- Ensure text is within character limit
- Verify SSML is well-formed (if used)
- Check API is enabled in Cloud Console
"Invalid input path"
- Cause: Path parameter is invalid
- Solution: Provide valid writable path or leave empty for temp file
"Failed to write audio to file"
- Cause: Cannot write to specified path
- Solution:
- Verify directory exists
- Check write permissions
- Ensure sufficient disk space
- Try using temp file (empty Path)
Poor Audio Quality
- Cause: Suboptimal settings
- Solution:
- Use Studio or Neural2 voices
- Increase sample rate (24000 Hz+)
- Use WAV format for uncompressed audio
- Check text formatting
Unnatural Pronunciation
- Cause: Text contains difficult words or formatting
- Solution:
- Enable SSML and use phoneme tags
- Spell out abbreviations and acronyms
- Add appropriate punctuation
- Use say-as tags for numbers and dates
Voice Not Available Error
- Cause: Selected voice doesn't exist for language
- Solution: Check available voices for your language
SSML Parsing Error
- Cause: Malformed SSML content
- Solution:
- Validate SSML XML structure
- Properly escape special characters
- Use valid SSML tags only
- Test with simple SSML first
SSML Examples
Basic Pauses and Emphasis
<speak>
Welcome to our service.
<break time="1s"/>
<emphasis level="strong">This is important.</emphasis>
Please <break time="500ms"/> continue.
</speak>
Pronunciation Control
<speak>
The year is <say-as interpret-as="date" format="y">2024</say-as>.
Call us at <say-as interpret-as="telephone">555-123-4567</say-as>.
Price: <say-as interpret-as="currency">$99.99</say-as>
</speak>
Speaking Rate and Pitch
<speak>
Normal speed text here.
<prosody rate="slow">This part is spoken slowly.</prosody>
<prosody rate="fast" pitch="+2st">This is fast and higher pitch.</prosody>
</speak>
Multiple Languages (using phonemes)
<speak>
How to say tomato:
<phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme> or
<phoneme alphabet="ipa" ph="təˈmɑːtəʊ">tomato</phoneme>
</speak>