Skip to main content

Speech to Text

Transcribes audio files to text using ElevenLabs AI's speech recognition powered by OpenAI Whisper.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.

Inputs

  • Connection Id (String) - Connection ID from the Connect node. Optional if you provide API Key directly.
  • Audio File Path (String) - Path to the audio file to transcribe (supports common formats like MP3, WAV, M4A, etc.).

Options

  • Model - Speech recognition model to use:
    • Whisper-1 - OpenAI Whisper v1 model for accurate speech recognition
  • Language (String) - Optional language code to guide transcription (e.g., "en" for English, "es" for Spanish, "fr" for French). Leave empty for automatic language detection.
  • Prompt (String) - Optional text prompt to guide the transcription style or vocabulary. Useful for technical terms or specific contexts.
  • Temperature (String) - Sampling temperature from 0.0 to 1.0 (default: 0). Higher values increase randomness, lower values make output more deterministic.
  • API Key - Your ElevenLabs AI API key. Optional if using Connection ID.

Outputs

  • Transcription (Object) - The transcribed text and metadata, including:
    • text - The transcribed text content
    • language - Detected or specified language (if available)

How It Works

The Speech to Text node transcribes audio to text using AI. When executed, the node:

  1. Validates that the audio file path is provided
  2. Validates temperature if specified (must be between 0.0 and 1.0)
  3. Either uses the provided connection or creates a new client with direct API key
  4. Opens the audio file and prepares transcription parameters
  5. Calls the ElevenLabs speech-to-text API with optional language, prompt, and temperature
  6. Returns the transcribed text and metadata

Requirements

  • Valid ElevenLabs API key (via Connect node or direct option)
  • Audio file in a supported format (MP3, WAV, M4A, FLAC, etc.)
  • Temperature value must be between 0.0 and 1.0 if specified

Error Handling

The node will return specific errors in the following cases:

  • Missing audio path - "Audio File Path cannot be empty. Please provide the path to the audio file."
  • Invalid temperature range - "Temperature must be between 0.0 and 1.0."
  • Invalid temperature format - "Temperature must be a valid number between 0.0 and 1.0."
  • File not found - "Audio file not found at: [path]. Please verify the file path is correct."
  • Transcription failure - "Failed to transcribe audio: [error details]"

Usage Notes

  • Language - Specifying the language can improve accuracy, but auto-detection works well for most cases
  • Prompt - Use prompts to guide transcription for:
    • Technical terminology
    • Industry-specific vocabulary
    • Proper nouns and names
    • Specific formatting preferences
  • Temperature - Usually best left at 0 for most accurate, deterministic results. Increase only if you need creative variations
  • The Whisper model supports many languages and can auto-detect them
  • Audio quality affects transcription accuracy - clear audio with minimal background noise works best
  • Longer audio files may take more time to process
  • The model handles various accents and speaking styles
  • Supports various audio formats - common formats like MP3, WAV work well

Example Use Cases

  • Transcribing meeting recordings or interviews
  • Converting voice memos to text
  • Creating subtitles from video audio tracks
  • Extracting text from podcast episodes
  • Transcribing phone calls or voicemails
  • Converting audio notes to searchable text
  • Accessibility features for audio content
  • Analyzing customer service calls