Transcribe Audio with Whisper-Large-v3-Turbo

AudioIntermediate15 min

Tenzro's audio runtime serves automatic speech recognition (ASR) out of the box. Wave 1 supports OpenAI Whisper, Distil-Whisper, Moonshine v2, and NVIDIA Parakeet/Canary. This tutorial uses whisper-large-v3-turbo— the strongest permissively-licensed ASR model in the catalog — to transcribe a WAV file via the CLI and JSON-RPC.

1. Download the model

# Whisper-Large-v3-Turbo is permissively licensed (MIT)
tenzro model download whisper-large-v3-turbo

# Output:
# Resolving artifact bundle from HuggingFace Hub...
#   Source: openai/whisper-large-v3-turbo (ONNX export)
#   License tier: Permissive
#   Files: encoder.onnx, decoder.onnx, tokenizer.json, config.json
#   SHA-256 verified for all files
#   Saved to: ~/.tenzro/models/whisper-large-v3-turbo/

2. Load into the audio runtime

The audio runtime decodes WAV via hound, MP3/FLAC via symphonia, computes a 128-bin mel-spectrogram with realfft, and feeds the encoder/decoder bundle.

# Load into the audio (ASR) runtime
tenzro audio load whisper-large-v3-turbo

# Output:
# Audio runtime loaded:
#   Model: whisper-large-v3-turbo
#   Modality: audio (ASR)
#   Sample rate: 16000 Hz
#   Mel bins: 128
#   Bundle: encoder + decoder

3. Transcribe via the CLI

# Transcribe a WAV file (also supports MP3 and FLAC via symphonia)
tenzro transcribe \
  --model whisper-large-v3-turbo \
  --audio ./meeting.wav \
  --language en

# Output:
# Transcription:
#   "The settlement layer reconciles every inference call against on-chain
#    receipts. Each receipt binds the output to the model weights hash..."
#
# Duration: 42.3s
# Latency: 3.1s (real-time factor: 0.07x)

4. Transcribe via JSON-RPC

# Equivalent JSON-RPC call. Audio is base64-encoded raw bytes.
AUDIO_B64=$(base64 -i ./meeting.wav)

curl https://rpc.tenzro.network \
  -X POST \
  -H "Content-Type: application/json" \
  -d "{
    \"jsonrpc\": \"2.0\",
    \"id\": 1,
    \"method\": \"tenzro_transcribe\",
    \"params\": {
      \"model_id\": \"whisper-large-v3-turbo\",
      \"audio_base64\": \"$AUDIO_B64\",
      \"language\": \"en\"
    }
  }" | jq

A typical response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "model_id": "whisper-large-v3-turbo",
    "text": "The settlement layer reconciles every inference call against on-chain receipts...",
    "language": "en",
    "duration_secs": 42.3,
    "latency_ms": 3100
  }
}

5. Other ASR models in the catalog