ModelSelector - LLM Optimizer & Hardware Simulator

Live App: https://harrytyp.github.io/modelselector/

An interactive, high-fidelity browser-side simulator that estimates memory allocation, context scaling, and generation throughput (tokens/s) for any given local LLM quantization under specific hardware constraints.

The entire system is powered by dynamic, live external APIs to build the hardware database, model metadata registries, and task-performance quality indexes in real-time.

System Architecture & Data Flow

This diagram illustrates the full data pipeline, from browser-side simulation through live Hugging Face, Open LLM Leaderboard, and GPU database APIs, down to the physics engine that computes VRAM allocation and tokens/s:

Dynamic API Endpoints Directory

0. Data Sources & Live Status

All benchmark data is fetched nightly by the GitHub Actions workflow and cached in data/cache.json. Every value in the app shows where it came from. The cache payload includes a sources block with per-source status, URL, row count, and last-updated timestamp.

Source	URL	Status	Refreshes
GGUF Model Catalog	`huggingface.co/api/models?filter=gguf`	✅ Live	Daily 00:00 UTC
Open LLM Leaderboard v2	`datasets-server.huggingface.co/rows` (dataset: open-llm-leaderboard/contents)	✅ Live	Daily 00:00 UTC
BenchLM.ai	`benchlm.ai/api/data/leaderboard`	✅ Live	Daily 00:00 UTC
LiveBench	`livebench.ai/table_YYYY_MM_DD.csv`	✅ Live	Daily 00:00 UTC
EvalPlus	`evalplus.github.io/results.json`	✅ Live	Daily 00:00 UTC
Quantization Constants	GGUF specification + llama.cpp community data	✅ Fixed	On change
GPU Database (TechPowerUp)	RightNow-AI GitHub	✅ Live	Daily 00:00 UTC
Speed Estimation Engine (per-quant scaling)	llama.cpp community benchmarks + GGUF spec	⚡ Real-time	On every search
Quantization Loss (fallback)	PPL community averages (llama.cpp)	📊 Fallback	On every search

1. Hugging Face Models API (Live Keyword Search)

Endpoint: https://huggingface.co/api/models
HTTP Method: GET
Query Parameters:
- search: Keyword string (e.g. gemma-2-2b)
- filter: Filter by tag (gguf used to isolate local models)
- sort: Metric sorting (downloads used to rank popularity)
- direction: Order (-1 for descending)
- limit: Max matching records (15)
- full: Boolean (true returns parameter count, tags, and file trees)
Usage: Performs autocomplete model matching and downloads counts lookup inside the Live Model Hub sidebar search.

Sample Fetch:

curl "https://huggingface.co/api/models?search=gemma-2-2b&filter=gguf&sort=downloads&direction=-1&limit=1&full=true"

2. Hugging Face Single Model Metadata API

Endpoint: https://huggingface.co/api/models/{model_id}
HTTP Method: GET
Usage: Triggered when loading a specific model to extract download counts, file size lists, and safetensors weights dimensions.

Sample Fetch:

curl "https://huggingface.co/api/models/google/gemma-2-2b-it"

3. Hugging Face Resolve API (Transformer config.json)

Endpoint: https://huggingface.co/{model_id}/resolve/{branch}/config.json
HTTP Method: GET
Branch Fallback: System attempts main branch first; if missing, falls back to master branch.
CORS Strategy: The /resolve/ endpoint is explicitly hosted on Hugging Face's global CDN and includes the Access-Control-Allow-Origin: * header, bypassing raw origin restrictions.
Usage: Downloads the model's structural specifications to extract parameters used in physics equations:
- num_hidden_layers / num_layers: Layers count (determines VRAM overhead and KV layers)
- hidden_size: Dimension scaling (calculates attention matrix parameters)
- num_attention_heads / num_key_value_heads: Handles GQA (Grouped-Query Attention) ratios used in KV cache VRAM estimates
- vocab_size / intermediate_size: Estimates MLPs parameters

Sample Fetch:

curl -L "https://huggingface.co/google/gemma-2-2b-it/resolve/main/config.json"

4. Hugging Face Open LLM Leaderboard API

Endpoint: https://datasets-server.huggingface.co/rows
HTTP Method: GET
Query Parameters:
- dataset: Name of targeted leaderboard contents (open-llm-leaderboard/contents)
- config: Config profile (default)
- split: Targeted split (train)
- limit: Pagination row counts (100)
Usage: Scrapes the scientific task-performance benchmark scores of the top 100 open-source models on startup, linking verified averages directly to imported GGUF variants.

Sample Fetch:

curl "https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents&config=default&split=train&limit=100"

5. TechPowerUp GPU Database (RightNow-AI Dataset)

Endpoints:
- NVIDIA: https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/nvidia/all.json
- AMD: https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/amd/all.json
- Intel: https://raw.githubusercontent.com/RightNow-AI/RightNow-GPU-Database/main/data/intel/all.json
HTTP Method: GET
Usage: A comprehensive open-source database containing technical specifications (transistor count, VRAM in GB, memory bandwidth in GB/s, bus width, TDP, and release dates) for over 2,750+ hardware entries scraped from TechPowerUp.
Filter/Compilation: The build compiler (refresh_cache.py) filters and builds a local dynamic cache database of 812 modern GPUs with dedicated VRAM >= 4.0 GB, ensuring extremely fast offline-ready lookups and searches in the frontend.

Running Locally & Handling CORS

Browsers apply security protocols (CORS) that block fetching relative files (like data/cache.json) when opened via the file protocol (file:///).

To host the application locally under secure origins:

Open your terminal in the directory.
Run a simple Python server:
```
python -m http.server 8000
```
Open your browser and go to: http://localhost:8000/
The top right indicator will light up green: Database: Synced.

Deployment & Automated Refresh

Because ModelSelector is 100% static (composed exclusively of index.html and the pre-computed database data/cache.json), it can be hosted for free on any static provider without requiring a backend database or node runner.

Live Deployment

https://harrytyp.github.io/modelselector/

Hosting Options:

GitHub Pages: Already configured. Pushes to main automatically trigger a Pages deployment. The nightly cache refresh also triggers a redeploy so the live site stays current.
Vercel / Netlify:
1. Connect your repository to Vercel or Netlify.
2. Leave the build command blank and publish the root directory.

Automated Nightly Refresh (GitHub Actions):

A GitHub Actions workflow runs every night at 00:00 UTC to refresh the model catalog and benchmark scores:

Fetches the latest GGUF models from Hugging Face Hub (paginated, with rate-limit handling)
Paginates the full Open LLM Leaderboard v2 dataset (4,500+ entries)
Queries BenchLM.ai for multi-dimensional rankings
Fetches LiveBench scores from the latest CSV release
Fetches EvalPlus coding benchmark scores from their live JSON
Rebuilds the GPU catalog from the TechPowerUp database
Stores quantization constants, quality loss factors, and per-quant speed scaling in cache
Commits the updated data/cache.json and triggers a Pages redeploy

See .github/workflows/refresh_cache.yml for the full workflow definition.

KV Cache & TPS Physics Simulator Formulations

Memory and speed allocations are compiled using verified GQA-aware formulas:

Quantized Model Size (Weights VRAM):
$$\text{VRAM}_{\text{Weights}} = \frac{\text{Parameters (B)} \times \text{BytesPerWeight (Quant)}}{10^9}$$
GQA-Aware KV Cache Overhead:
$$\text{VRAM}_{\text{KV Cache}} = \frac{2 \times \text{layers} \times \text{context\_length} \times \text{hidden\_size} \times \left( \frac{\text{num\_kv\_heads}}{\text{num\_attn\_heads}} \right) \times 2 \text{ [bytes/fp16]}}{10^9}$$
Throughput Compilation (Tokens/s):
- If VRAM fits entirely within the GPU: $$\text{TPS} = \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Weights Size (GB)}} \times \text{Efficiency Bonus (35\% for GPU)} \times \text{Quant Scaling Factor}$$
- If memory limits require offloading to system RAM: $$\text{TPS} = \frac{\text{System Bus Bandwidth (GB/s)}}{\text{Model Weights Size (GB)}} \times \text{Efficiency (22\% for CPU)} \times \text{Quant Scaling Factor}$$
- Per-quant speed scaling — quantization type affects throughput: Q2_K (1.10x), Q4_K_M (1.00x baseline), Q8_0 (0.78x), fp16 (0.65x). These factors are stored in the nightly cache and override JS fallbacks.

Simplified User Flow & WebGL Scanner

Following targeted refactoring to keep the simulator lightning-fast and highly focused, the system implements a direct, ultra-transparent developer interface:

Snappy Cache-First Databases: Startup loads the default model/GPU catalog instantly from the fast local JSON cache (data/cache.json), preventing Hugging Face API rate limits or startup fetch lag.
Instant WebGL Hardware Scan: Click the Scan My GPU (Auto-Detect) button next to target devices to automatically identify your active graphics card in 1-click using the browser's WebGL debug extensions.
Completely Transparent Options List: Dynamic physics simulations are mapped in real-time across all available models, displayed directly inside a dense, sortable table. All complex advisor grid cards have been removed to prioritize raw table transparency.
Rich Model Cards: Click any row to reveal detailed quantization specifications, memory scaling graphs, and direct links to Hugging Face (https://huggingface.co/{model_id}) and the Open LLM Leaderboard.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
.kilo		.kilo
assets		assets
data		data
scripts		scripts
({v		({v
.gitignore		.gitignore
.nojekyll		.nojekyll
README.md		README.md
fetch.js		fetch.js
hermes_banner.png		hermes_banner.png
index.html		index.html
recherche.md		recherche.md
reddit_post.txt		reddit_post.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelSelector - LLM Optimizer & Hardware Simulator

System Architecture & Data Flow

Dynamic API Endpoints Directory

0. Data Sources & Live Status

1. Hugging Face Models API (Live Keyword Search)

2. Hugging Face Single Model Metadata API

3. Hugging Face Resolve API (Transformer config.json)

4. Hugging Face Open LLM Leaderboard API

5. TechPowerUp GPU Database (RightNow-AI Dataset)

Running Locally & Handling CORS

Deployment & Automated Refresh

Live Deployment

Hosting Options:

Automated Nightly Refresh (GitHub Actions):

KV Cache & TPS Physics Simulator Formulations

Simplified User Flow & WebGL Scanner

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ModelSelector - LLM Optimizer & Hardware Simulator

System Architecture & Data Flow

Dynamic API Endpoints Directory

0. Data Sources & Live Status

1. Hugging Face Models API (Live Keyword Search)

2. Hugging Face Single Model Metadata API

3. Hugging Face Resolve API (Transformer config.json)

4. Hugging Face Open LLM Leaderboard API

5. TechPowerUp GPU Database (RightNow-AI Dataset)

Running Locally & Handling CORS

Deployment & Automated Refresh

Live Deployment

Hosting Options:

Automated Nightly Refresh (GitHub Actions):

KV Cache & TPS Physics Simulator Formulations

Simplified User Flow & WebGL Scanner

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages