Text To Speech

Converts text or SSML content to natural-sounding speech audio using Google Cloud Text-to-Speech API with support for multiple languages, voices, and audio formats.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Text - The text or SSML content to convert to speech. Supports both plain text and SSML (Speech Synthesis Markup Language) formats for enhanced control. Maximum length: 5,000 characters.
Path - The file path where the generated audio file will be saved. If empty, the audio will be saved to a temporary file with the appropriate extension. The path should be absolute and writable.

Output

Path - The absolute file path where the generated audio file was saved. Use this path to access, play, or further process the audio file in subsequent nodes.

Options

Audio Configuration

Audio Encoding - The output audio format for the generated speech. Default is WAV.
- Wav - Uncompressed LINEAR16 PCM audio. Highest quality, largest file size. Best for:
  - Professional audio production
  - Further audio processing
  - Maximum compatibility
- Mp3 - Compressed MP3 format. Good quality, smaller file size. Best for:
  - Web applications
  - Storage efficiency
  - Email attachments
- Ogg Opus - Compressed OGG Opus format. Excellent quality with efficient compression. Best for:
  - Web streaming
  - Real-time applications
  - Low bandwidth scenarios
Sample Rate - The sample rate (in Hertz) for the generated audio. Default is 16000 Hz. Higher sample rates provide better quality but larger files. Common values:
- 8000 - Telephone quality, smallest files
- 16000 - Wideband quality (recommended default)
- 22050 - FM radio quality
- 24000 - High quality
- 32000 - Very high quality
- 44100 - CD quality
- 48000 - Professional quality

Voice Configuration

Language Code - The BCP-47 language code for speech synthesis. Default is en-US. This determines the language and regional variant. Examples:
- en-US - English (United States)
- en-GB - English (United Kingdom)
- en-AU - English (Australia)
- es-ES - Spanish (Spain)
- es-US - Spanish (United States)
- fr-FR - French (France)
- de-DE - German (Germany)
- ja-JP - Japanese
- ko-KR - Korean
- zh-CN - Chinese (Simplified)
- tr-TR - Turkish
- pt-BR - Portuguese (Brazil)
Voice Name - The specific voice to use for synthesis. Default is en-US-Studio-M. Voice names follow the pattern: {languageCode}-{voiceType}-{gender}. Examples:
- en-US-Studio-M - Male Studio voice (US English)
- en-US-Studio-F - Female Studio voice (US English)
- en-US-Neural2-A - Neural voice A (US English)
- en-US-Wavenet-A - WaveNet voice A (US English)
- es-ES-Standard-A - Standard voice A (Spanish)
- fr-FR-Studio-D - Studio voice D (French)
Voice types:
- Studio - Highest quality, most natural (recommended)
- Neural2 - High quality neural voices
- Wavenet - Advanced neural voices
- Standard - Basic quality voices

Advanced Options

SSML Text - Enable if the input text is in SSML (Speech Synthesis Markup Language) format. Default is false. When enabled:
- Allows fine-grained control over pronunciation, emphasis, pauses, and pitch
- Supports audio effects and custom dictionaries
- Requires well-formed SSML XML structure
- See SSML Reference for details

Authentication

Credentials - Google Cloud service account credentials in JSON format. The service account must have:
- Cloud Text-to-Speech API enabled
- Text-to-Speech Client role or equivalent permissions
- Active billing account associated with the project

How It Works

The Text to Speech node generates natural-sounding audio from text using Google Cloud's advanced neural networks. When executed:

Validates the input text is not empty (max 5,000 characters)
Authenticates with Google Cloud Text-to-Speech API using provided credentials
Configures synthesis parameters:
- Selects the specified voice (language code + voice name)
- Sets audio encoding (WAV, MP3, or OGG Opus)
- Applies the specified sample rate
- Processes input as plain text or SSML
Sends synthesis request to Google Cloud
Receives generated audio content
Saves audio to the specified path (or temporary file if no path provided)
Returns the file path for further processing

Requirements

Valid Google Cloud credentials with Text-to-Speech API enabled
Non-empty text content (max 5,000 characters)
Writable file path (or system temp directory access)
Internet connectivity to Google Cloud services

Error Handling

The node returns specific errors for common issues:

ErrInvalidArg:
- Empty or missing text input
- Invalid audio encoding format
- Invalid sample rate
- Invalid language code or voice name
- Text exceeds 5,000 character limit
- Synthesis operation failed
ErrCredentials:
- Invalid or missing credentials
- Malformed JSON credentials
- Missing credential content
- Insufficient API permissions
ErrFilePath:
- Cannot create temporary file
- Invalid temporary directory
ErrWriteFile:
- Cannot write audio to file
- Path is not writable
- Disk space issues

Common Google Cloud API errors:

Invalid voice name: Verify voice exists for specified language
Invalid language code: Use correct BCP-47 language codes
API not enabled: Enable Text-to-Speech API in Cloud Console
Permission denied: Ensure service account has TTS Client role
Quota exceeded: Check usage limits in Cloud Console

Example Use Cases

Generate Video Narration

Input:
  Text: "Welcome to our training course on automation with Robomotion."
  Path: /path/to/videos/narration.mp3

Options:
  Audio Encoding: Mp3
  Language Code: en-US
  Voice Name: en-US-Studio-M
  Sample Rate: 24000

Use Case: Create professional voiceovers for training videos and presentations.

Automated Phone System

Input:
  Text: "Thank you for calling. Your call is important to us. Please hold."
  Path: /path/to/ivr/hold-message.wav

Options:
  Audio Encoding: Wav
  Language Code: en-US
  Voice Name: en-US-Neural2-F
  Sample Rate: 8000

Use Case: Generate IVR messages and automated phone system prompts.

Multilingual Notifications

Input:
  Text: "Su pedido ha sido enviado y llegará en 2 días."
  Path: /notifications/shipping-es.mp3

Options:
  Audio Encoding: Mp3
  Language Code: es-ES
  Voice Name: es-ES-Studio-A
  Sample Rate: 16000

Use Case: Create audio notifications in multiple languages for global customers.

SSML Enhanced Speech

Input:
  Text: |
    <speak>
      Hello <break time="500ms"/> and welcome.
      <emphasis level="strong">This is important!</emphasis>
      <prosody rate="slow" pitch="+2st">
        Please listen carefully.
      </prosody>
    </speak>
  Path: /audio/announcement.wav

Options:
  Audio Encoding: Wav
  SSML Text: true
  Language Code: en-US
  Voice Name: en-US-Studio-F
  Sample Rate: 24000

Use Case: Create dynamic announcements with precise control over timing and emphasis.

Accessible Content

Input:
  Text: "This is the text content from the article to be read aloud."
  Path: /accessible/article-audio.mp3

Options:
  Audio Encoding: Mp3
  Language Code: en-US
  Voice Name: en-US-Studio-M
  Sample Rate: 22050

Use Case: Convert written content to audio for visually impaired users.

Email Notifications

Input:
  Text: "You have a new message from customer support. Please check your inbox."

Options:
  Audio Encoding: Mp3
  Language Code: en-GB
  Voice Name: en-GB-Neural2-A
  Sample Rate: 16000

Use Case: Generate audio notifications and send as email attachments.
Note: Path is empty, so audio is saved to a temporary file.

Usage Notes

Text Input Requirements

Maximum length: 5,000 characters
For longer texts, split into multiple chunks
Plain text and SSML are supported
SSML must be well-formed XML
Special characters should be properly escaped in SSML

Voice Selection Guidelines

Studio voices: Best quality, most natural sounding (recommended)
Neural2 voices: High quality, good naturalness
Wavenet voices: Advanced quality, natural prosody
Standard voices: Basic quality, lower cost
Voice availability varies by language
Not all voices support all features
Test different voices to find the best fit
Use language-appropriate voices (don't use en-US voice for Spanish text)

Audio Format Selection

WAV: Choose for maximum quality and compatibility
- No compression artifacts
- Best for further audio processing
- Larger file sizes
MP3: Choose for balanced quality and size
- Good compression efficiency
- Wide compatibility
- Smaller files than WAV
OGG Opus: Choose for streaming and web
- Best compression efficiency
- Excellent quality at low bitrates
- May have compatibility limitations

Sample Rate Guidelines

Higher sample rates = better quality but larger files
Match sample rate to your use case:
- 8000 Hz: Phone systems, IVR
- 16000 Hz: General use, voice assistants (recommended default)
- 22050-24000 Hz: High quality content, videos
- 44100-48000 Hz: Professional production, music
Some voices may have sample rate limitations

SSML Features

Enable SSML to control:

Pauses: <break time="500ms"/>
Emphasis: <emphasis level="strong">important</emphasis>
Prosody: <prosody rate="slow" pitch="+2st">text</prosody>
Say As: <say-as interpret-as="cardinal">123</say-as>
Phonemes: <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
Audio: <audio src="https://example.com/sound.mp3">fallback</audio>

File Path Handling

If Path is empty, file is saved to temporary directory
Temporary files have unique names to avoid conflicts
File extension matches the audio encoding (.wav, .mp3, .ogg)
Ensure target directory exists and is writable
Use absolute paths for reliability

Cost Optimization

Text-to-Speech is billed per 1 million characters
Standard voices cost less than Neural/Wavenet/Studio voices
Cache generated audio files to reduce API calls
Reuse audio for repeated content
Monitor usage in Google Cloud Console

Language and Voice Matching

Always match voice to language code
Using mismatched voice/language produces errors
Each language has different available voices
Check Google Cloud voices for available options

Output Usage

The Path output contains the file system path to the generated audio:

// Access the audio file path
const audioPath = $.path;

// Use in subsequent nodes
// - Upload to cloud storage
// - Send via email
// - Play in application
// - Convert to different format
// - Analyze audio properties

Example flow:

Text to Speech node generates audio → outputs path
Upload File node uploads audio to cloud storage
Send Email node attaches the audio file
Delete File node cleans up temporary file

Tips for Best Results

Voice Selection: Test multiple voices to find the most natural option
- Studio voices provide the best quality
- Match voice gender to content context
- Consider regional accents for target audience
Text Preparation: Format text for optimal speech
- Use proper punctuation for natural pauses
- Spell out numbers, acronyms, and abbreviations
- Break long sentences into shorter segments
- Remove formatting characters (markdown, HTML, etc.)
SSML Usage: Leverage SSML for precise control
- Add pauses for emphasis and clarity
- Control pronunciation of difficult words
- Adjust speaking rate for complex content
- Use emphasis for key points
Audio Format: Choose based on use case
- WAV for production and editing
- MP3 for web and distribution
- OGG Opus for streaming applications
Sample Rate: Balance quality and file size
- 16kHz is sufficient for most voice content
- Use higher rates only when quality is critical
- Match rate to your audio infrastructure
Error Handling: Implement robust error recovery
- Validate text length before processing
- Check credentials before batch operations
- Handle temporary file cleanup
- Retry on transient network errors
Batch Processing: For multiple texts
- Reuse credentials across nodes
- Process in parallel when possible
- Implement rate limiting if needed
- Cache results when appropriate
Quality Control: Verify output quality
- Listen to generated samples
- Check pronunciation of technical terms
- Adjust SSML for natural flow
- Test with target audience
File Management: Handle generated files properly
- Use meaningful file names
- Organize by language/voice if needed
- Clean up temporary files
- Implement backup strategies
Localization: For multilingual content
- Use appropriate language codes
- Select culturally appropriate voices
- Test with native speakers
- Consider regional preferences

Common Errors and Solutions

"Input text cannot be empty"

Cause: No text provided in the Text input
Solution: Provide valid text content (1-5,000 characters)

"No Credential Content"

Cause: Credentials not properly configured
Solution: Verify service account JSON is complete and valid

"invalid encoding"

Cause: Audio encoding option is invalid
Solution: Use one of: wav, mp3, ogg

"Failed to synthesize speech"

Cause: Various API or configuration issues
Solution:
- Verify language code is valid (BCP-47 format)
- Check voice name exists for specified language
- Ensure text is within character limit
- Verify SSML is well-formed (if used)
- Check API is enabled in Cloud Console

"Invalid input path"

Cause: Path parameter is invalid
Solution: Provide valid writable path or leave empty for temp file

"Failed to write audio to file"

Cause: Cannot write to specified path
Solution:
- Verify directory exists
- Check write permissions
- Ensure sufficient disk space
- Try using temp file (empty Path)

Poor Audio Quality

Cause: Suboptimal settings
Solution:
- Use Studio or Neural2 voices
- Increase sample rate (24000 Hz+)
- Use WAV format for uncompressed audio
- Check text formatting

Unnatural Pronunciation

Cause: Text contains difficult words or formatting
Solution:
- Enable SSML and use phoneme tags
- Spell out abbreviations and acronyms
- Add appropriate punctuation
- Use say-as tags for numbers and dates

Voice Not Available Error

Cause: Selected voice doesn't exist for language
Solution: Check available voices for your language

SSML Parsing Error

Cause: Malformed SSML content
Solution:
- Validate SSML XML structure
- Properly escape special characters
- Use valid SSML tags only
- Test with simple SSML first

SSML Examples

Basic Pauses and Emphasis

<speak>
  Welcome to our service.
  <break time="1s"/>
  <emphasis level="strong">This is important.</emphasis>
  Please <break time="500ms"/> continue.
</speak>

Pronunciation Control

<speak>
  The year is <say-as interpret-as="date" format="y">2024</say-as>.
  Call us at <say-as interpret-as="telephone">555-123-4567</say-as>.
  Price: <say-as interpret-as="currency">$99.99</say-as>
</speak>

Speaking Rate and Pitch

<speak>
  Normal speed text here.
  <prosody rate="slow">This part is spoken slowly.</prosody>
  <prosody rate="fast" pitch="+2st">This is fast and higher pitch.</prosody>
</speak>

Multiple Languages (using phonemes)

<speak>
  How to say tomato:
  <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme> or
  <phoneme alphabet="ipa" ph="təˈmɑːtəʊ">tomato</phoneme>
</speak>

Google Cloud Text-to-Speech Documentation
Available Voices
SSML Reference
Supported Languages
Audio Profiles
Pricing Information
BCP-47 Language Codes
Voice Samples - Listen to voice examples

Common Properties​

Inputs​

Output​

Options​

Audio Configuration​

Voice Configuration​

Advanced Options​

Authentication​

How It Works​

Requirements​

Error Handling​

Example Use Cases​

Generate Video Narration​

Automated Phone System​

Multilingual Notifications​

SSML Enhanced Speech​

Accessible Content​

Email Notifications​

Usage Notes​

Text Input Requirements​

Voice Selection Guidelines​

Audio Format Selection​

Sample Rate Guidelines​

SSML Features​

File Path Handling​

Cost Optimization​

Language and Voice Matching​

Output Usage​

Tips for Best Results​

Common Errors and Solutions​

"Input text cannot be empty"​

"No Credential Content"​

"invalid encoding"​

"Failed to synthesize speech"​

"Invalid input path"​

"Failed to write audio to file"​

Poor Audio Quality​

Unnatural Pronunciation​

Voice Not Available Error​

SSML Parsing Error​

SSML Examples​

Basic Pauses and Emphasis​

Pronunciation Control​

Speaking Rate and Pitch​

Multiple Languages (using phonemes)​

Related Resources​