Voxtral Transcribe

Web UI for local speech-to-text using voxtral.c on Apple Silicon.

Quick Start

Prerequisites: macOS with Apple Silicon, ffmpeg, Python 3.10+, uv

git clone https://github.com/r-dh/voxtral-app
cd voxtral-app

# Build voxtral.c
cd voxtral.c
make mps
./download_model.sh    # ~8.9 GB
cd ..

# Run
uv sync
uv run uvicorn server:app --host 0.0.0.0 --port 8000

Open http://localhost:8000.

Features

Drag & drop audio files (WAV, MP3, OGG, M4A, FLAC, AAC) or record from the microphone
Tokens stream to the browser as the model generates them
Model stays loaded in GPU memory between transcriptions (~8.2 GB)
Cancel mid-transcription without unloading the model (SIGUSR1)
Adjustable speed/accuracy tradeoff (WER 6.7% to 12.6%)
Single HTML file, no build tools

How It Works

Browser  <-- SSE -->  FastAPI  <-- stdin/stdout -->  voxtral --server
                                                       |
                                                   voxtral-model/
                                                   (8.9 GB, Metal GPU)

The server manages a persistent voxtral --server process. The model loads once into GPU memory and stays resident across transcriptions. Each request sends a WAV path over stdin, receives tokens on stdout and progress on stderr. Server-Sent Events relay everything to the browser.

Cancelling sends SIGUSR1 to abort the current transcription without killing the process. The model stays loaded and the next transcription starts immediately.

Performance

Tested on M1 Max (32-core GPU, 64 GB unified memory):

Audio	Time	Speed	Quality setting
7s clip	~18s	0.4x realtime	Balanced
60s clip	~150s	0.4x realtime	Balanced
60s clip	~250s	0.2x realtime	Most accurate

First run includes ~25s to load the model. Subsequent transcriptions start immediately.

This is a 4 billion parameter model running locally. For faster local transcription with smaller models, see whisper.cpp.

Project Structure

server.py           FastAPI server, process management, SSE streaming
static/index.html   Single-file frontend (vanilla HTML/CSS/JS)
voxtral.c/          antirez's voxtral.c (pure C, Metal GPU)

Requirements

Apple Silicon Mac (M1, M2, M3, or M4)
~10 GB disk for model weights
ffmpeg (brew install ffmpeg)
Python 3.10+ (uv sync or pip install fastapi uvicorn python-multipart)

Credits

voxtral.c by antirez
Voxtral Realtime 4B by Mistral AI

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
static		static
voxtral.c		voxtral.c
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
screenshot.png		screenshot.png
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral Transcribe

Quick Start

Features

How It Works

Performance

Project Structure

Requirements

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxtral Transcribe

Quick Start

Features

How It Works

Performance

Project Structure

Requirements

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages