Skip to main content

Text To Speech

Converts text or SSML content to natural-sounding speech audio using Google Cloud Text-to-Speech API with support for multiple languages, voices, and audio formats.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Text - The text or SSML content to convert to speech. Supports both plain text and SSML (Speech Synthesis Markup Language) formats for enhanced control. Maximum length: 5,000 characters.

  • Path - The file path where the generated audio file will be saved. If empty, the audio will be saved to a temporary file with the appropriate extension. The path should be absolute and writable.

Output

  • Path - The absolute file path where the generated audio file was saved. Use this path to access, play, or further process the audio file in subsequent nodes.

Options

Audio Configuration

  • Audio Encoding - The output audio format for the generated speech. Default is WAV.

    • Wav - Uncompressed LINEAR16 PCM audio. Highest quality, largest file size. Best for:
      • Professional audio production
      • Further audio processing
      • Maximum compatibility
    • Mp3 - Compressed MP3 format. Good quality, smaller file size. Best for:
      • Web applications
      • Storage efficiency
      • Email attachments
    • Ogg Opus - Compressed OGG Opus format. Excellent quality with efficient compression. Best for:
      • Web streaming
      • Real-time applications
      • Low bandwidth scenarios
  • Sample Rate - The sample rate (in Hertz) for the generated audio. Default is 16000 Hz. Higher sample rates provide better quality but larger files. Common values:

    • 8000 - Telephone quality, smallest files
    • 16000 - Wideband quality (recommended default)
    • 22050 - FM radio quality
    • 24000 - High quality
    • 32000 - Very high quality
    • 44100 - CD quality
    • 48000 - Professional quality

Voice Configuration

  • Language Code - The BCP-47 language code for speech synthesis. Default is en-US. This determines the language and regional variant. Examples:

    • en-US - English (United States)
    • en-GB - English (United Kingdom)
    • en-AU - English (Australia)
    • es-ES - Spanish (Spain)
    • es-US - Spanish (United States)
    • fr-FR - French (France)
    • de-DE - German (Germany)
    • ja-JP - Japanese
    • ko-KR - Korean
    • zh-CN - Chinese (Simplified)
    • tr-TR - Turkish
    • pt-BR - Portuguese (Brazil)
  • Voice Name - The specific voice to use for synthesis. Default is en-US-Studio-M. Voice names follow the pattern: {languageCode}-{voiceType}-{gender}. Examples:

    • en-US-Studio-M - Male Studio voice (US English)
    • en-US-Studio-F - Female Studio voice (US English)
    • en-US-Neural2-A - Neural voice A (US English)
    • en-US-Wavenet-A - WaveNet voice A (US English)
    • es-ES-Standard-A - Standard voice A (Spanish)
    • fr-FR-Studio-D - Studio voice D (French)

    Voice types:

    • Studio - Highest quality, most natural (recommended)
    • Neural2 - High quality neural voices
    • Wavenet - Advanced neural voices
    • Standard - Basic quality voices

Advanced Options

  • SSML Text - Enable if the input text is in SSML (Speech Synthesis Markup Language) format. Default is false. When enabled:
    • Allows fine-grained control over pronunciation, emphasis, pauses, and pitch
    • Supports audio effects and custom dictionaries
    • Requires well-formed SSML XML structure
    • See SSML Reference for details

Authentication

  • Credentials - Google Cloud service account credentials in JSON format. The service account must have:
    • Cloud Text-to-Speech API enabled
    • Text-to-Speech Client role or equivalent permissions
    • Active billing account associated with the project

How It Works

The Text to Speech node generates natural-sounding audio from text using Google Cloud's advanced neural networks. When executed:

  1. Validates the input text is not empty (max 5,000 characters)
  2. Authenticates with Google Cloud Text-to-Speech API using provided credentials
  3. Configures synthesis parameters:
    • Selects the specified voice (language code + voice name)
    • Sets audio encoding (WAV, MP3, or OGG Opus)
    • Applies the specified sample rate
    • Processes input as plain text or SSML
  4. Sends synthesis request to Google Cloud
  5. Receives generated audio content
  6. Saves audio to the specified path (or temporary file if no path provided)
  7. Returns the file path for further processing

Requirements

  • Valid Google Cloud credentials with Text-to-Speech API enabled
  • Non-empty text content (max 5,000 characters)
  • Writable file path (or system temp directory access)
  • Internet connectivity to Google Cloud services

Error Handling

The node returns specific errors for common issues:

  • ErrInvalidArg:

    • Empty or missing text input
    • Invalid audio encoding format
    • Invalid sample rate
    • Invalid language code or voice name
    • Text exceeds 5,000 character limit
    • Synthesis operation failed
  • ErrCredentials:

    • Invalid or missing credentials
    • Malformed JSON credentials
    • Missing credential content
    • Insufficient API permissions
  • ErrFilePath:

    • Cannot create temporary file
    • Invalid temporary directory
  • ErrWriteFile:

    • Cannot write audio to file
    • Path is not writable
    • Disk space issues

Common Google Cloud API errors:

  • Invalid voice name: Verify voice exists for specified language
  • Invalid language code: Use correct BCP-47 language codes
  • API not enabled: Enable Text-to-Speech API in Cloud Console
  • Permission denied: Ensure service account has TTS Client role
  • Quota exceeded: Check usage limits in Cloud Console

Example Use Cases

Generate Video Narration

Input:
Text: "Welcome to our training course on automation with Robomotion."
Path: /path/to/videos/narration.mp3

Options:
Audio Encoding: Mp3
Language Code: en-US
Voice Name: en-US-Studio-M
Sample Rate: 24000

Use Case: Create professional voiceovers for training videos and presentations.

Automated Phone System

Input:
Text: "Thank you for calling. Your call is important to us. Please hold."
Path: /path/to/ivr/hold-message.wav

Options:
Audio Encoding: Wav
Language Code: en-US
Voice Name: en-US-Neural2-F
Sample Rate: 8000

Use Case: Generate IVR messages and automated phone system prompts.

Multilingual Notifications

Input:
Text: "Su pedido ha sido enviado y llegará en 2 días."
Path: /notifications/shipping-es.mp3

Options:
Audio Encoding: Mp3
Language Code: es-ES
Voice Name: es-ES-Studio-A
Sample Rate: 16000

Use Case: Create audio notifications in multiple languages for global customers.

SSML Enhanced Speech

Input:
Text: |
<speak>
Hello <break time="500ms"/> and welcome.
<emphasis level="strong">This is important!</emphasis>
<prosody rate="slow" pitch="+2st">
Please listen carefully.
</prosody>
</speak>
Path: /audio/announcement.wav

Options:
Audio Encoding: Wav
SSML Text: true
Language Code: en-US
Voice Name: en-US-Studio-F
Sample Rate: 24000

Use Case: Create dynamic announcements with precise control over timing and emphasis.

Accessible Content

Input:
Text: "This is the text content from the article to be read aloud."
Path: /accessible/article-audio.mp3

Options:
Audio Encoding: Mp3
Language Code: en-US
Voice Name: en-US-Studio-M
Sample Rate: 22050

Use Case: Convert written content to audio for visually impaired users.

Email Notifications

Input:
Text: "You have a new message from customer support. Please check your inbox."

Options:
Audio Encoding: Mp3
Language Code: en-GB
Voice Name: en-GB-Neural2-A
Sample Rate: 16000

Use Case: Generate audio notifications and send as email attachments.
Note: Path is empty, so audio is saved to a temporary file.

Usage Notes

Text Input Requirements

  • Maximum length: 5,000 characters
  • For longer texts, split into multiple chunks
  • Plain text and SSML are supported
  • SSML must be well-formed XML
  • Special characters should be properly escaped in SSML

Voice Selection Guidelines

  • Studio voices: Best quality, most natural sounding (recommended)

  • Neural2 voices: High quality, good naturalness

  • Wavenet voices: Advanced quality, natural prosody

  • Standard voices: Basic quality, lower cost

  • Voice availability varies by language

  • Not all voices support all features

  • Test different voices to find the best fit

  • Use language-appropriate voices (don't use en-US voice for Spanish text)

Audio Format Selection

  • WAV: Choose for maximum quality and compatibility

    • No compression artifacts
    • Best for further audio processing
    • Larger file sizes
  • MP3: Choose for balanced quality and size

    • Good compression efficiency
    • Wide compatibility
    • Smaller files than WAV
  • OGG Opus: Choose for streaming and web

    • Best compression efficiency
    • Excellent quality at low bitrates
    • May have compatibility limitations

Sample Rate Guidelines

  • Higher sample rates = better quality but larger files
  • Match sample rate to your use case:
    • 8000 Hz: Phone systems, IVR
    • 16000 Hz: General use, voice assistants (recommended default)
    • 22050-24000 Hz: High quality content, videos
    • 44100-48000 Hz: Professional production, music
  • Some voices may have sample rate limitations

SSML Features

Enable SSML to control:

  • Pauses: <break time="500ms"/>
  • Emphasis: <emphasis level="strong">important</emphasis>
  • Prosody: <prosody rate="slow" pitch="+2st">text</prosody>
  • Say As: <say-as interpret-as="cardinal">123</say-as>
  • Phonemes: <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
  • Audio: <audio src="https://example.com/sound.mp3">fallback</audio>

File Path Handling

  • If Path is empty, file is saved to temporary directory
  • Temporary files have unique names to avoid conflicts
  • File extension matches the audio encoding (.wav, .mp3, .ogg)
  • Ensure target directory exists and is writable
  • Use absolute paths for reliability

Cost Optimization

  • Text-to-Speech is billed per 1 million characters
  • Standard voices cost less than Neural/Wavenet/Studio voices
  • Cache generated audio files to reduce API calls
  • Reuse audio for repeated content
  • Monitor usage in Google Cloud Console

Language and Voice Matching

  • Always match voice to language code
  • Using mismatched voice/language produces errors
  • Each language has different available voices
  • Check Google Cloud voices for available options

Output Usage

The Path output contains the file system path to the generated audio:

// Access the audio file path
const audioPath = $.path;

// Use in subsequent nodes
// - Upload to cloud storage
// - Send via email
// - Play in application
// - Convert to different format
// - Analyze audio properties

Example flow:

  1. Text to Speech node generates audio → outputs path
  2. Upload File node uploads audio to cloud storage
  3. Send Email node attaches the audio file
  4. Delete File node cleans up temporary file

Tips for Best Results

  1. Voice Selection: Test multiple voices to find the most natural option

    • Studio voices provide the best quality
    • Match voice gender to content context
    • Consider regional accents for target audience
  2. Text Preparation: Format text for optimal speech

    • Use proper punctuation for natural pauses
    • Spell out numbers, acronyms, and abbreviations
    • Break long sentences into shorter segments
    • Remove formatting characters (markdown, HTML, etc.)
  3. SSML Usage: Leverage SSML for precise control

    • Add pauses for emphasis and clarity
    • Control pronunciation of difficult words
    • Adjust speaking rate for complex content
    • Use emphasis for key points
  4. Audio Format: Choose based on use case

    • WAV for production and editing
    • MP3 for web and distribution
    • OGG Opus for streaming applications
  5. Sample Rate: Balance quality and file size

    • 16kHz is sufficient for most voice content
    • Use higher rates only when quality is critical
    • Match rate to your audio infrastructure
  6. Error Handling: Implement robust error recovery

    • Validate text length before processing
    • Check credentials before batch operations
    • Handle temporary file cleanup
    • Retry on transient network errors
  7. Batch Processing: For multiple texts

    • Reuse credentials across nodes
    • Process in parallel when possible
    • Implement rate limiting if needed
    • Cache results when appropriate
  8. Quality Control: Verify output quality

    • Listen to generated samples
    • Check pronunciation of technical terms
    • Adjust SSML for natural flow
    • Test with target audience
  9. File Management: Handle generated files properly

    • Use meaningful file names
    • Organize by language/voice if needed
    • Clean up temporary files
    • Implement backup strategies
  10. Localization: For multilingual content

    • Use appropriate language codes
    • Select culturally appropriate voices
    • Test with native speakers
    • Consider regional preferences

Common Errors and Solutions

"Input text cannot be empty"

  • Cause: No text provided in the Text input
  • Solution: Provide valid text content (1-5,000 characters)

"No Credential Content"

  • Cause: Credentials not properly configured
  • Solution: Verify service account JSON is complete and valid

"invalid encoding"

  • Cause: Audio encoding option is invalid
  • Solution: Use one of: wav, mp3, ogg

"Failed to synthesize speech"

  • Cause: Various API or configuration issues
  • Solution:
    • Verify language code is valid (BCP-47 format)
    • Check voice name exists for specified language
    • Ensure text is within character limit
    • Verify SSML is well-formed (if used)
    • Check API is enabled in Cloud Console

"Invalid input path"

  • Cause: Path parameter is invalid
  • Solution: Provide valid writable path or leave empty for temp file

"Failed to write audio to file"

  • Cause: Cannot write to specified path
  • Solution:
    • Verify directory exists
    • Check write permissions
    • Ensure sufficient disk space
    • Try using temp file (empty Path)

Poor Audio Quality

  • Cause: Suboptimal settings
  • Solution:
    • Use Studio or Neural2 voices
    • Increase sample rate (24000 Hz+)
    • Use WAV format for uncompressed audio
    • Check text formatting

Unnatural Pronunciation

  • Cause: Text contains difficult words or formatting
  • Solution:
    • Enable SSML and use phoneme tags
    • Spell out abbreviations and acronyms
    • Add appropriate punctuation
    • Use say-as tags for numbers and dates

Voice Not Available Error

  • Cause: Selected voice doesn't exist for language
  • Solution: Check available voices for your language

SSML Parsing Error

  • Cause: Malformed SSML content
  • Solution:
    • Validate SSML XML structure
    • Properly escape special characters
    • Use valid SSML tags only
    • Test with simple SSML first

SSML Examples

Basic Pauses and Emphasis

<speak>
Welcome to our service.
<break time="1s"/>
<emphasis level="strong">This is important.</emphasis>
Please <break time="500ms"/> continue.
</speak>

Pronunciation Control

<speak>
The year is <say-as interpret-as="date" format="y">2024</say-as>.
Call us at <say-as interpret-as="telephone">555-123-4567</say-as>.
Price: <say-as interpret-as="currency">$99.99</say-as>
</speak>

Speaking Rate and Pitch

<speak>
Normal speed text here.
<prosody rate="slow">This part is spoken slowly.</prosody>
<prosody rate="fast" pitch="+2st">This is fast and higher pitch.</prosody>
</speak>

Multiple Languages (using phonemes)

<speak>
How to say tomato:
<phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme> or
<phoneme alphabet="ipa" ph="təˈmɑːtəʊ">tomato</phoneme>
</speak>