Skip to main content

Transcribe Audio

Transcribe audio files to text using OpenAI Whisper and GPT-4o transcription models.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. Default: false.

Inputs

  • Connection Id - Connection identifier from Connect node.
  • Audio File - Path to audio file. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg.
  • Use Robomotion AI Credits - Use Robomotion credits instead of your own API key.

Options

Model Selection

  • Model - Speech recognition model:
    • GPT-4o Transcribe - Latest high-accuracy model (default)
    • GPT-4o Mini Transcribe - Fast and efficient
    • GPT-4o Transcribe (Diarize) - Identifies different speakers
    • Whisper-1 - Original Whisper model

Transcription Settings

  • Language - Audio language in ISO-639-1 format (e.g., "en", "es", "fr"). Optional - model auto-detects if not specified.
  • Prompt - Optional text to guide transcription style, vocabulary, or context.
  • Temperature - Sampling temperature (0-1). Higher values increase randomness. Default: 0.

Advanced

  • Timeout (seconds) - Request timeout. Default: 120.
  • Include Raw Response - Include timestamps and segments. Default: false.

Outputs

  • Text - Transcribed text from the audio file.
  • Raw Response - Full response with timestamps and segments (when enabled).

How It Works

Transcribes speech in audio files to text:

  1. Validates connection and audio file path
  2. Checks file format is supported
  3. Uploads audio to OpenAI
  4. Processes with selected model
  5. Returns transcribed text

Usage Examples

Example 1: Basic Transcription

Input:
- Audio File: "C:/recordings/meeting.mp3"
- Model: gpt-4o-transcribe

Output:
- Text: "Good morning everyone. Let's begin today's meeting..."

Example 2: Multilingual Transcription

Input:
- Audio File: "C:/audio/spanish_call.wav"
- Language: "es"
- Model: gpt-4o-transcribe

Output:
- Text: "Buenos días, gracias por llamar..."

Example 3: Speaker Diarization

Input:
- Audio File: "C:/interviews/conversation.mp3"
- Model: gpt-4o-transcribe-diarize

Output (in Raw Response):
- Segments with speaker identification
- Speaker 1: "Hello, how are you?"
- Speaker 2: "I'm doing well, thank you."

Example 4: With Context Prompt

Input:
- Audio File: "C:/medical/consultation.wav"
- Prompt: "Medical consultation discussing diabetes, insulin, and glucose levels"
- Model: gpt-4o-transcribe

Output:
- Text: More accurate medical terminology transcription

Requirements

  • Connection Id from Connect node
  • Valid audio file in supported format
  • File must exist and be accessible

Tips for RPA Developers

  • Model Selection: Use GPT-4o Transcribe for best accuracy, GPT-4o Mini for speed, Diarize for multi-speaker scenarios.
  • Language Specification: Specify language for better accuracy, especially for non-English audio.
  • Prompt Usage: Provide context or technical vocabulary to improve accuracy.
  • File Formats: MP3 and M4A work well for most use cases. WAV is uncompressed and larger.
  • Temperature: Keep at 0 for most accurate transcription. Increase for creative transcription.
  • File Size: OpenAI has file size limits (typically 25MB). Compress large files if needed.
  • Timestamps: Enable "Include Raw Response" to get word-level timestamps.

Common Errors

"Audio File cannot be empty"

  • Provide a valid path to an audio file

"Audio file does not exist"

  • Check file path is correct and file exists

"Unsupported file format"

  • Convert audio to supported format (mp3, wav, etc.)