An interactive, high-fidelity browser-side simulator that estimates memory allocation, context scaling, and generation throughput (tokens/s) for any given local LLM quantization under specific hardware constraints.
The entire system is powered by dynamic, live external APIs to build the hardware database, model metadata registries, and task-performance quality indexes in real-time.
This diagram illustrates the full data pipeline, from browser-side simulation through live Hugging Face, Open LLM Leaderboard, and GPU database APIs, down to the physics engine that computes VRAM allocation and tokens/s:
All benchmark data is fetched nightly by the GitHub Actions workflow and cached in data/cache.json. Every value in the app shows where it came from. The cache payload includes a sources block with per-source status, URL, row count, and last-updated timestamp.
| Source | URL | Status | Refreshes |
|---|---|---|---|
| GGUF Model Catalog | huggingface.co/api/models?filter=gguf |
✅ Live | Daily 00:00 UTC |
| Open LLM Leaderboard v2 | datasets-server.huggingface.co/rows (dataset: open-llm-leaderboard/contents) |
✅ Live | Daily 00:00 UTC |
| BenchLM.ai | benchlm.ai/api/data/leaderboard |
✅ Live | Daily 00:00 UTC |
| LiveBench | livebench.ai/table_YYYY_MM_DD.csv |
✅ Live | Daily 00:00 UTC |
| EvalPlus | evalplus.github.io/results.json |
✅ Live | Daily 00:00 UTC |
| Quantization Constants | GGUF specification + llama.cpp community data | ✅ Fixed | On change |
| GPU Database (TechPowerUp) | RightNow-AI GitHub | ✅ Live | Daily 00:00 UTC |
| Speed Estimation Engine (per-quant scaling) | llama.cpp community benchmarks + GGUF spec | ⚡ Real-time | On every search |
| Quantization Loss (fallback) | PPL community averages (llama.cpp) | 📊 Fallback | On every search |
- Endpoint:
https://huggingface.co/api/models - HTTP Method:
GET - Query Parameters:
search: Keyword string (e.g.gemma-2-2b)filter: Filter by tag (ggufused to isolate local models)sort: Metric sorting (downloadsused to rank popularity)direction: Order (-1for descending)limit: Max matching records (15)full: Boolean (truereturns parameter count, tags, and file trees)
- Usage: Performs autocomplete model matching and downloads counts lookup inside the Live Model Hub sidebar search.
- Sample Fetch:
curl "https://huggingface.co/api/models?search=gemma-2-2b&filter=gguf&sort=downloads&direction=-1&limit=1&full=true"
- Endpoint:
https://huggingface.co/api/models/{model_id} - HTTP Method:
GET - Usage: Triggered when loading a specific model to extract download counts, file size lists, and safetensors weights dimensions.
- Sample Fetch:
curl "https://huggingface.co/api/models/google/gemma-2-2b-it"
- Endpoint:
https://huggingface.co/{model_id}/resolve/{branch}/config.json - HTTP Method:
GET - Branch Fallback: System attempts
mainbranch first; if missing, falls back tomasterbranch. - CORS Strategy: The
/resolve/endpoint is explicitly hosted on Hugging Face's global CDN and includes theAccess-Control-Allow-Origin: *header, bypassing raw origin restrictions. - Usage: Downloads the model's structural specifications to extract parameters used in physics equations:
num_hidden_layers/num_layers: Layers count (determines VRAM overhead and KV layers)hidden_size: Dimension scaling (calculates attention matrix parameters)num_attention_heads/num_key_value_heads: Handles GQA (Grouped-Query Attention) ratios used in KV cache VRAM estimatesvocab_size/intermediate_size: Estimates MLPs parameters
- Sample Fetch:
curl -L "https://huggingface.co/google/gemma-2-2b-it/resolve/main/config.json"
- Endpoint:
https://datasets-server.huggingface.co/rows - HTTP Method:
GET - Query Parameters:
dataset: Name of targeted leaderboard contents (open-llm-leaderboard/contents)config: Config profile (default)split: Targeted split (train)limit: Pagination row counts (100)
- Usage: Scrapes the scientific task-performance benchmark scores of the top 100 open-source models on startup, linking verified averages directly to imported GGUF variants.
- Sample Fetch:
curl "https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents&config=default&split=train&limit=100"
- Endpoints:
- NVIDIA:
https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/nvidia/all.json - AMD:
https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/amd/all.json - Intel:
https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/intel/all.json
- NVIDIA:
- HTTP Method:
GET - Usage: A comprehensive open-source database containing technical specifications (transistor count, VRAM in GB, memory bandwidth in GB/s, bus width, TDP, and release dates) for over 2,750+ hardware entries scraped from TechPowerUp.
- Filter/Compilation: The build compiler (
refresh_cache.py) filters and builds a local dynamic cache database of 812 modern GPUs with dedicated VRAM >= 4.0 GB, ensuring extremely fast offline-ready lookups and searches in the frontend.
Browsers apply security protocols (CORS) that block fetching relative files (like data/cache.json) when opened via the file protocol (file:///).
To host the application locally under secure origins:
- Open your terminal in the directory.
- Run a simple Python server:
python -m http.server 8000
- Open your browser and go to: http://localhost:8000/
- The top right indicator will light up green: Database: Synced.
Because ModelSelector is 100% static (composed exclusively of index.html and the pre-computed database data/cache.json), it can be hosted for free on any static provider without requiring a backend database or node runner.
https://harrytyp.github.io/modelselector/
- GitHub Pages: Already configured. Pushes to
mainautomatically trigger a Pages deployment. The nightly cache refresh also triggers a redeploy so the live site stays current. - Vercel / Netlify:
- Connect your repository to Vercel or Netlify.
- Leave the build command blank and publish the root directory.
A GitHub Actions workflow runs every night at 00:00 UTC to refresh the model catalog and benchmark scores:
- Fetches the latest GGUF models from Hugging Face Hub (paginated, with rate-limit handling)
- Paginates the full Open LLM Leaderboard v2 dataset (4,500+ entries)
- Queries BenchLM.ai for multi-dimensional rankings
- Fetches LiveBench scores from the latest CSV release
- Fetches EvalPlus coding benchmark scores from their live JSON
- Rebuilds the GPU catalog from the TechPowerUp database
- Stores quantization constants, quality loss factors, and per-quant speed scaling in cache
- Commits the updated
data/cache.jsonand triggers a Pages redeploy
See .github/workflows/refresh_cache.yml for the full workflow definition.
Memory and speed allocations are compiled using verified GQA-aware formulas:
-
Quantized Model Size (Weights VRAM):
$$\text{VRAM}_{\text{Weights}} = \frac{\text{Parameters (B)} \times \text{BytesPerWeight (Quant)}}{10^9}$$ -
GQA-Aware KV Cache Overhead:
$$\text{VRAM}_{\text{KV Cache}} = \frac{2 \times \text{layers} \times \text{context\_length} \times \text{hidden\_size} \times \left( \frac{\text{num\_kv\_heads}}{\text{num\_attn\_heads}} \right) \times 2 \text{ [bytes/fp16]}}{10^9}$$ -
Throughput Compilation (Tokens/s):
- If VRAM fits entirely within the GPU:
$$\text{TPS} = \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Weights Size (GB)}} \times \text{Efficiency Bonus (35\% for GPU)} \times \text{Quant Scaling Factor}$$ - If memory limits require offloading to system RAM:
$$\text{TPS} = \frac{\text{System Bus Bandwidth (GB/s)}}{\text{Model Weights Size (GB)}} \times \text{Efficiency (22\% for CPU)} \times \text{Quant Scaling Factor}$$ - Per-quant speed scaling — quantization type affects throughput: Q2_K (1.10x), Q4_K_M (1.00x baseline), Q8_0 (0.78x), fp16 (0.65x). These factors are stored in the nightly cache and override JS fallbacks.
- If VRAM fits entirely within the GPU:
Following targeted refactoring to keep the simulator lightning-fast and highly focused, the system implements a direct, ultra-transparent developer interface:
- Snappy Cache-First Databases: Startup loads the default model/GPU catalog instantly from the fast local JSON cache (
data/cache.json), preventing Hugging Face API rate limits or startup fetch lag. - Instant WebGL Hardware Scan: Click the Scan My GPU (Auto-Detect) button next to target devices to automatically identify your active graphics card in 1-click using the browser's WebGL debug extensions.
- Completely Transparent Options List: Dynamic physics simulations are mapped in real-time across all available models, displayed directly inside a dense, sortable table. All complex advisor grid cards have been removed to prioritize raw table transparency.
- Rich Model Cards: Click any row to reveal detailed quantization specifications, memory scaling graphs, and direct links to Hugging Face (
https://huggingface.co/{model_id}) and the Open LLM Leaderboard.