Skip to content

r-dh/voxtral-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voxtral Transcribe

Web UI for local speech-to-text using voxtral.c on Apple Silicon.

Voxtral Transcribe

Quick Start

Prerequisites: macOS with Apple Silicon, ffmpeg, Python 3.10+, uv

git clone https://github.com/r-dh/voxtral-app
cd voxtral-app

# Build voxtral.c
cd voxtral.c
make mps
./download_model.sh    # ~8.9 GB
cd ..

# Run
uv sync
uv run uvicorn server:app --host 0.0.0.0 --port 8000

Open http://localhost:8000.

Features

  • Drag & drop audio files (WAV, MP3, OGG, M4A, FLAC, AAC) or record from the microphone
  • Tokens stream to the browser as the model generates them
  • Model stays loaded in GPU memory between transcriptions (~8.2 GB)
  • Cancel mid-transcription without unloading the model (SIGUSR1)
  • Adjustable speed/accuracy tradeoff (WER 6.7% to 12.6%)
  • Single HTML file, no build tools

How It Works

Browser  <-- SSE -->  FastAPI  <-- stdin/stdout -->  voxtral --server
                                                       |
                                                   voxtral-model/
                                                   (8.9 GB, Metal GPU)

The server manages a persistent voxtral --server process. The model loads once into GPU memory and stays resident across transcriptions. Each request sends a WAV path over stdin, receives tokens on stdout and progress on stderr. Server-Sent Events relay everything to the browser.

Cancelling sends SIGUSR1 to abort the current transcription without killing the process. The model stays loaded and the next transcription starts immediately.

Performance

Tested on M1 Max (32-core GPU, 64 GB unified memory):

Audio Time Speed Quality setting
7s clip ~18s 0.4x realtime Balanced
60s clip ~150s 0.4x realtime Balanced
60s clip ~250s 0.2x realtime Most accurate

First run includes ~25s to load the model. Subsequent transcriptions start immediately.

This is a 4 billion parameter model running locally. For faster local transcription with smaller models, see whisper.cpp.

Project Structure

server.py           FastAPI server, process management, SSE streaming
static/index.html   Single-file frontend (vanilla HTML/CSS/JS)
voxtral.c/          antirez's voxtral.c (pure C, Metal GPU)

Requirements

  • Apple Silicon Mac (M1, M2, M3, or M4)
  • ~10 GB disk for model weights
  • ffmpeg (brew install ffmpeg)
  • Python 3.10+ (uv sync or pip install fastapi uvicorn python-multipart)

Credits

About

Local speech-to-text with a web UI, powered by voxtral.c on Apple Silicon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors