Bench

Audit truth, not just rows.

Bench is where Silicon Score stays honest. It shows what is measured, what is still modeled, how runtimes are classified, which methodologies are comparable, and what the next research pass should attack.

Catalog state

Benchmark rows

250

Open issues

2

Research queue

8

Workflow comparisons

4

Model scorecards

10

Audit now

One operator board for truth debt.

This board merges unresolved quality issues, frontier heat without enough evidence, and the active operator queue. It is the shortest path from “what is weak?” to “what should we do next?”

KindPriorityTitleActionWhy
frontier hotspot237

Qwen3.5-27B

Qwen3.5-27B. It appears across 4 lenses and 5 budget slices. has 11 Apple Silicon benchmark rows. 1 official model brief captured. 4 fetched artifacts. 8 curated practitioner signals, 8 Apple Silicon-specific. Qwen3.5-27B: 8 Apple Silicon field reports; best reported generation ~31.6 tok/s; seen on MacBook Pro M5 MAX 128GB, M2 ULTRA 128GB, MacBook Pro M4 PRO; via MLX, llama.cpp.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot206

Qwen3.5-122B-A10B

Qwen3.5-122B-A10B. It appears across 4 lenses and 3 budget slices. has 6 Apple Silicon benchmark rows. 1 official model brief captured. 5 fetched artifacts, 1 blocked or partial. 6 curated practitioner signals, 6 Apple Silicon-specific. Qwen3.5-122B-A10B: 4 Apple Silicon field reports; best reported generation ~65.9 tok/s; best reported prompt processing ~500 tok/s; seen on MacBook Pro M5 MAX 128GB, Mac Studio M3 ULTRA 256GB, MacBook Pro M5 PRO 64GB; via MLX, llama.cpp.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot205

Devstral Small 2 24B

Devstral Small 2 24B. It appears across 3 lenses and 5 budget slices. has 6 Apple Silicon benchmark rows. 1 official model brief captured. 5 fetched artifacts. 4 curated practitioner signals, 5 Apple Silicon-specific. Devstral Small 2 24B: 5 Apple Silicon field reports; best reported generation ~47 tok/s; seen on Mac Studio M3 ULTRA 256GB, M1 ULTRA 64GB; via MLX.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot195

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B. It appears across 4 lenses and 1 budget slice. has 2 Apple Silicon benchmark rows. 1 official model brief captured. 3 fetched artifacts. 4 curated practitioner signals, 5 Apple Silicon-specific. Qwen3.5-397B-A17B: 5 Apple Silicon field reports; best reported generation ~40 tok/s; seen on Mac Studio M3 ULTRA 512GB, MacBook Pro M5 MAX 128GB; via MLX, flash-moe.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot182

Qwen3.5-35B-A3B

Qwen3.5-35B-A3B. It appears across 3 lenses and 4 budget slices. has 7 Apple Silicon benchmark rows. 1 official model brief captured. 7 fetched artifacts. 8 curated practitioner signals, 11 Apple Silicon-specific. Qwen3.5-35B-A3B: 11 Apple Silicon field reports; best reported generation ~89.4 tok/s; seen on MacBook Pro M1 MAX 64GB, Mac Mini M4 PRO 64GB, Apple Silicon; via llama.cpp, MLX, Ollama.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot180

Qwen 3 32B

Qwen 3 32B. It appears across 4 lenses and 5 budget slices. has 2 Apple Silicon benchmark rows. 1 official model brief captured. 3 fetched artifacts. 2 curated practitioner signals, 2 Apple Silicon-specific. Qwen 3 32B: 1 Apple Silicon field report; best reported generation ~20 tok/s; seen on MacBook Pro M4 MAX 128GB; via MLX.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot175

GLM-4.5-Air

GLM-4.5-Air. It appears across 4 lenses and 2 budget slices. has 1 Apple Silicon benchmark row. 1 official model brief captured. 4 fetched artifacts. 3 curated practitioner signals, 2 Apple Silicon-specific. GLM-4.5-Air: 1 Apple Silicon field report; best reported generation ~54 tok/s; seen on Mac Studio M3 ULTRA 256GB; via MLX.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.
frontier hotspot171

Llama 3.3 70B

Llama 3.3 70B. It appears across 4 lenses and 4 budget slices. has 2 Apple Silicon benchmark rows. 1 official locator ready for capture. 2 fetched artifacts. 1 curated practitioner signal, 1 Apple Silicon-specific. Llama 3.3 70B: 1 Apple Silicon field report; best reported generation ~11.8 tok/s; seen on MacBook Pro M4 MAX 128GB; via MLX.

capture practitioner runtime notesThis model matters, but first-party or broader cross-chip evidence is still thin.

Evidence pressure

Coverage is still the main weakness.

Metadata debt is now small. The bigger issue is still how much of the catalog depends on community rows and thin runtime coverage.

Macs

28

Models

33

Source captures

89

Evidence classRowsShare
community unverified22188%
trusted reference2811%
factory measured10%

Unresolved issues

Small list, but still real debt.

Repair m1-pro-16-core-gpu--llama-2-7b--q4-0--llamacpp

Machine RAM is still unknown because the linked source identifies the chip variant but not the exact memory tier.

recover machine ram tier

Repair m3-pro-18-core-gpu--llama-2-7b--q4-0--llamacpp

Machine RAM is still unknown because the linked source identifies the chip variant but not the exact memory tier.

recover machine ram tier

Runtime taxonomy

Teach runtime as a facet, not a mini-site.

Public runtime locks should stay simple. Backends and wrappers still matter, but they belong here as audit semantics rather than top-level navigation.

RuntimeLayerPublic lockRowsWhy it exists

Llamafile

Llamafile wrapper on llama.cpp

Wrapper

llama.cpp stack

Public filter: llama.cpp stack.

167Mozilla's portable llama.cpp packaging and CLI. LocalScore runs on top of this stack for submitted benchmark results. Latest version verified from official GitHub releases on 2026-03-25.

MLX

MLX backend

Backend

MLX

Public filter: MLX.

44Apple's ML framework optimized for Apple Silicon unified memory. Uses MLX-format weights (not GGUF). Often faster than llama.cpp on Apple hardware for supported models. Version approximate as of March 2026.

llama.cpp

llama.cpp backend

Backend

llama.cpp stack

Public filter: llama.cpp stack.

24Reference C++ inference engine. GGUF model format. Metal backend for Apple Silicon. Offers llama-bench for standardized benchmarking. Version is approximate build number as of March 2026.

LM Studio

LM Studio wrapper on mixed

Wrapper

Audit only

Do not expose as a canonical runtime lock.

13Desktop application wrapping llama.cpp and MLX. GUI-based model management. Supports GGUF and MLX formats. Version approximate as of March 2026.

Ollama

Ollama wrapper on llama.cpp

Wrapper

Ollama · llama.cpp stack

Public filter: Ollama · llama.cpp stack.

1Wraps llama.cpp with model management, API server, and automatic quantization selection. Most common runtime in community benchmarks. Metal GPU acceleration on macOS.

Methodology comparability

Comparability groups matter as much as raw speed.

MethodologyComparabilityRuntime versionRowsNotes
LocalScore Standard BenchmarklocalscoreNot fixed167Standard LocalScore benchmark methodology. Uses fixed prompt templates across model sizes. Public submissions come from the official LocalScore CLI, which LocalScore documents as running on top of Llamafile.
Reddit Operator Benchmark Reportreddit_operator_benchmarksNot fixed16Structured practitioner benchmark reports posted to Reddit with explicit machine, runtime, quantization, and generation-speed tables. Useful as community benchmark evidence, but methodology varies and should not be treated as directly comparable to lab runs.
Awni MLX Benchmark (GitHub Gist)mlx_communityNot fixed10MLX benchmarks published by Awni Hannun (MLX team) as GitHub Gists. Uses mlx-lm generate with specific prompts. Methodology varies between gists; check individual source for exact parameters.
Reddit Standardized Eval Reportreddit_standardized_evalsNot fixed10Structured Apple Silicon evaluation reports posted to Reddit with explicit model, quantization, decode speed, RAM use, and task scores. Useful for frontier coverage but still community-reported rather than independently reproduced.
estsauver LM Studio Benchmarklm_studio_communityNot fixed9Community benchmarks run via LM Studio. Methodology details partially documented. Check source posts for specific LM Studio version and settings.
Reddit MLX Context Sweepreddit_context_sweep_benchmarksmlx-lm (version reported in post when available)7Direct Apple Silicon benchmark posts with raw mlx_lm.generate output across multiple prompt lengths. These rows are stronger than generic operator anecdotes because they expose exact prompt token counts, generation speed, and peak memory.
Reddit Runtime Speed Notereddit_runtime_notesNot fixed7Short Reddit field reports with an explicit Apple Silicon machine, runtime path, and approximate tokens-per-second figure. Useful for runtime and constrained-tier coverage, but lighter-weight and less comparable than structured benchmark tables.
Mac Mini M4 16GB GGUF Sweep v1community_batch_benchmarkllama-server (version not fixed in Silicon Score)6Large automated GGUF benchmark sweep on a Mac Mini M4 16GB. The source reports throughput at 1024 and 4096 context, concurrency 1 and 2, plus TTFT and quality lanes. Silicon Score publishes only a small curated subset of rows and uses the average of the two single-request throughput columns as the decode-speed summary.
llama-bench Defaultllamacpp_benchNot fixed5Default llama-bench settings. Runs prompt processing (pp512) and text generation (tg128) benchmarks. Highly reproducible when llama.cpp version is controlled. batch_size refers to prompt processing batch.
Reddit oMLX Context Sweepreddit_context_sweep_benchmarksoMLX (version not fixed in Silicon Score)5Structured Apple Silicon benchmark posts with raw oMLX output across multiple prompt lengths. These rows expose TTFT, prompt speed, decode speed, and peak memory, but remain community-reported and backend-specific.
SharpAI HomeSec-Bench v1published_agent_benchmarksllama.cpp (version not fixed in Silicon Score)4Published SharpAI benchmark page comparing local Apple Silicon models and cloud models on the HomeSec-Bench task suite. Silicon Score uses the page's reported average TTFT, decode speed, and memory numbers as structured reference evidence rather than as directly lab-comparable canon.
GitHub PR Benchmark Tablemlx_communityfeature branch or pull request build when stated2Benchmark tables shared inside GitHub pull request threads for runtime changes. Useful for before-and-after deltas on Apple Silicon, but prompt details and run controls are usually only partially disclosed.
Hugging Face Model Card Benchmarkcommunity_model_card_benchmarkNot fixed1Single-machine benchmark figures published in Hugging Face model cards or README files for a specific quantized release. These rows are useful for Apple Silicon frontier coverage when the source gives an explicit machine, runtime, and throughput number, but settings remain source-defined rather than lab-standardized.
Silicon Score Lab Ollama v1silicon_score_labollama 0.6.21First-party controlled benchmarks from the Silicon Score lab. Ollama with fixed version, standardized prompt, 3 warmup runs, 5 measured runs, median reported. Full methodology disclosed with each result set.

Frontier hotspots

Where frontier interest outruns evidence.

Qwen3.5-27B

Qwen3.5-27B. It appears across 4 lenses and 5 budget slices. has 11 Apple Silicon benchmark rows. 1 official model brief captured. 4 fetched artifacts. 8 curated practitioner signals, 8 Apple Silicon-specific. Qwen3.5-27B: 8 Apple Silicon field reports; best reported generation ~31.6 tok/s; seen on MacBook Pro M5 MAX 128GB, M2 ULTRA 128GB, MacBook Pro M4 PRO; via MLX, llama.cpp.

capture practitioner runtime notes

Qwen3.5-122B-A10B

Qwen3.5-122B-A10B. It appears across 4 lenses and 3 budget slices. has 6 Apple Silicon benchmark rows. 1 official model brief captured. 5 fetched artifacts, 1 blocked or partial. 6 curated practitioner signals, 6 Apple Silicon-specific. Qwen3.5-122B-A10B: 4 Apple Silicon field reports; best reported generation ~65.9 tok/s; best reported prompt processing ~500 tok/s; seen on MacBook Pro M5 MAX 128GB, Mac Studio M3 ULTRA 256GB, MacBook Pro M5 PRO 64GB; via MLX, llama.cpp.

capture practitioner runtime notes

Devstral Small 2 24B

Devstral Small 2 24B. It appears across 3 lenses and 5 budget slices. has 6 Apple Silicon benchmark rows. 1 official model brief captured. 5 fetched artifacts. 4 curated practitioner signals, 5 Apple Silicon-specific. Devstral Small 2 24B: 5 Apple Silicon field reports; best reported generation ~47 tok/s; seen on Mac Studio M3 ULTRA 256GB, M1 ULTRA 64GB; via MLX.

capture practitioner runtime notes

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B. It appears across 4 lenses and 1 budget slice. has 2 Apple Silicon benchmark rows. 1 official model brief captured. 3 fetched artifacts. 4 curated practitioner signals, 5 Apple Silicon-specific. Qwen3.5-397B-A17B: 5 Apple Silicon field reports; best reported generation ~40 tok/s; seen on Mac Studio M3 ULTRA 512GB, MacBook Pro M5 MAX 128GB; via MLX, flash-moe.

capture practitioner runtime notes

Qwen3.5-35B-A3B

Qwen3.5-35B-A3B. It appears across 3 lenses and 4 budget slices. has 7 Apple Silicon benchmark rows. 1 official model brief captured. 7 fetched artifacts. 8 curated practitioner signals, 11 Apple Silicon-specific. Qwen3.5-35B-A3B: 11 Apple Silicon field reports; best reported generation ~89.4 tok/s; seen on MacBook Pro M1 MAX 64GB, Mac Mini M4 PRO 64GB, Apple Silicon; via llama.cpp, MLX, Ollama.

capture practitioner runtime notes

Qwen 3 32B

Qwen 3 32B. It appears across 4 lenses and 5 budget slices. has 2 Apple Silicon benchmark rows. 1 official model brief captured. 3 fetched artifacts. 2 curated practitioner signals, 2 Apple Silicon-specific. Qwen 3 32B: 1 Apple Silicon field report; best reported generation ~20 tok/s; seen on MacBook Pro M4 MAX 128GB; via MLX.

capture practitioner runtime notes

Operator queue

Concrete work already queued.

Verify Qwen 3 30B A3B on the owned M4 Max 64 GB

4 reference row(s) already exist on this exact chip across 4 quantization(s), with 52.58-92.09 tok/s signal.

lab verification

Expand Llama 3.3 70B beyond 1 Apple Silicon tier

Llama 3.3 70B is a high-value purchase target but currently has published rows on only 1 chip tier.

coverage expansion

Expand Qwen 3 235B A22B beyond 1 Apple Silicon tier

Qwen 3 235B A22B is a high-value purchase target but currently has published rows on only 1 chip tier.

coverage expansion

Verify Qwen 3 4B on the owned M4 Max 64 GB

6 reference row(s) already exist on this exact chip across 6 quantization(s), with 111.55-149.07 tok/s signal.

lab verification

Verify Qwen 2.5 14B Instruct on the owned M4 Max 64 GB

1 reference row(s) already exist on this exact chip across 1 quantization(s), with 25.87-25.87 tok/s signal.

lab verification

Verify Llama 3.1 8B Instruct on the owned M4 Max 64 GB

1 reference row(s) already exist on this exact chip across 1 quantization(s), with 47.1-47.1 tok/s signal.

lab verification

Model scorecards

Teach “smart and fast,” not just “fast.”

These standardized Apple Silicon evals add task-shape truth to the catalog. They come from one fixed high-end Mac and runtime path, so they should not override the main rankings, but they are excellent for understanding which models are fast, balanced, coding-heavy, or tool-soft.

10 scorecardsSingle-machine comparability
ModelMachineQuantAvgToolsCodeReasonDecodeRAMWhy it matters

Qwen3.5-122B-A10B

vLLM-MLX

Mac Studio M3 Ultra 256GB8bit89%87%90%90%42.7 tok/s129.8 GB

Highest overall quality in this standardized set, but it demands real memory.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-122B-A10B

vLLM-MLX

Mac Studio M3 Ultra 256GBmxfp488%90%90%80%57.0 tok/s65.0 GB

The best value version in this scorecard: near-frontier quality at roughly half the RAM.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-35B-A3B

vLLM-MLX

Mac Studio M3 Ultra 256GB8bit85%90%90%80%80.0 tok/s36.9 GB

The stronger version of the 35B MoE story: fast and much more balanced.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3-Coder-Next

vLLM-MLX

Mac Studio M3 Ultra 256GB6bit82%87%90%80%65.6 tok/s64.8 GB

Slightly slower than 4-bit, but reasoning is stronger and coding stays high.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3-Coder-Next

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit80%90%90%70%73.5 tok/s44.9 GB

The fast coding-first option in this scorecard, with strong tool behavior.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

GLM-4.5-Air

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit78%73%90%70%53.6 tok/s60.3 GB

More balanced than the flash variant, but materially heavier.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-27B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit76%83%90%50%37.7 tok/s15.3 GB

A strong fits-anywhere coding and tool-use compromise.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-35B-A3B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit74%87%90%50%95.2 tok/s19.6 GB

Very fast for its size, but reasoning softness is visible in the standardized tasks.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-9B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit71%83%70%60%106.4 tok/s5.1 GB

The smallest model in this set that still looks broadly useful for agent-style work.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Devstral Small 2 24B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit62%17%90%70%47.2 tok/s13.4 GB

Strong coding score, but tool calling is poor in this standardized setup.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Workflow metrics

Teach runtime tradeoffs without polluting decode-speed canon.

These records capture effective throughput and prefill-heavy scenarios from the same Mac and model across different runtime paths. They are valuable for teaching runtime choice, but they should stay in the audit lane rather than flatten into headline tokens-per-second rows.

4 workflow comparisons17 runtime results
ModelMachineScenarioBest runtimeSpreadSourceWhy it matters

Qwen 3 30B-A3B

Effective tok/s · Interactive

MacBook Pro M1 Max 64GBOps agent (8 turns)LM Studio (41.7)0.3 tok/s

These are effective throughput figures on a multi-turn ops-agent scenario. They include prefill and wrapper behavior, so they should teach runtime choice, not replace decode-speed benchmark rows.

LM Studio 41.7 · llama.cpp 41.4 · oMLX 38.0 · Ollama 26.0

Qwen3.5-35B-A3B

Effective tok/s · Interactive

MacBook Pro M1 Max 64GBOps agent (8 turns)oMLX (38.0)2.4 tok/s

These are effective throughput results on an ops-agent workflow. They are best used to compare runtime behavior and caching quality on Apple Silicon, not to replace canonical decode-speed rows.

oMLX 38.0 · Rapid-MLX 35.6 · mlx-openai-server 26.2 · LM Studio (GGUF) 17.6 · LM Studio (MLX) 17.0

Qwen3.5-35B-A3B

Effective tok/s · 8,000 ctx

MacBook Pro M1 Max 64GBPrefill stress (8K context)oMLX (16.4)7.7 tok/s

This is an 8K prefill-stress comparison. It is useful for understanding caching and long-context behavior, not for headline decode-speed ranking.

oMLX 16.4 · mlx-openai-server 8.7 · Rapid-MLX 8.5 · LM Studio (GGUF) 7.8 · LM Studio (MLX) 5.9

Qwen 3 30B-A3B

Effective tok/s · 8,000 ctx

MacBook Pro M1 Max 64GBPrefill stress (8K context)MLX fp16 (8.6)1.0 tok/s

This compares wrappers and backends on an 8K prefill-stress scenario. It is useful for long-context teaching, but it is not a canonical decode-speed row.

MLX fp16 8.6 · GGUF 7.6 · MLX bf16 6.0

Research map

Separate measured evidence from discovery noise.

Not every source should influence the product equally. This map makes those roles explicit so rankings can say when an answer is measured, estimated, or still fit-first.

14 methodologies89 source captures11 tracked research sources

First-party

Where measured truth should be upgraded into canon.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

Silicon Score Lab

first party lab

highestbenchmark canon · reproductionHighMediumDirect fetchUse as the highest-confidence anchor and as the source that upgrades important external claims from directional to measured.

Benchmark reference

Where comparable benchmark anchors and runtime-change signals usually appear first.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

Awni Hannun benchmark gists

maintainer benchmark gists

highbenchmark corroboration · mlx referenceHighMediumDirect fetchCheck when MLX-related runtime or kernel changes may materially move Apple Silicon performance.

Hugging Face model hub

model registry

highmodel catalog updates · release trackingHighHighDirect fetchUse to keep the model catalog current and to seed follow-up research, not to answer fit or feel alone.

oMLX repository

runtime repo

mediumruntime capability tracking · apple silicon runtime changesMediumHighDirect fetchMonitor for Apple Silicon caching, serving, and long-context changes that could materially shift agentic coding viability on Macs.

vLLM-MLX repository

runtime repo

mediumruntime capability tracking · apple silicon runtime changesMediumHighDirect fetchUse for runtime roadmap and capability-shift detection, then validate important claims against Silicon Score’s decision lenses.

Practitioner

Where operators reveal workflow reality and caveats before the benchmark layer catches up.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

llama.cpp GitHub discussions

maintainer discussion

highbaseline reference · runtime change detectionMediumHighDirect fetchTreat as an early-warning and context source. Promote specific posts into canonical evidence only after methodology review.

Reddit /r/LocalLLaMA

operator forum

highpractitioner signals · frontier watchlistMediumHighPartial fetchUse for directional practitioner signals and frontier prioritization, not as standalone performance truth.

Discovery

Where release movement starts, but not where performance truth should harden.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

LocalScore accelerator runs

community benchmark aggregator

highcoverage discovery · gap detectionLowHighPartial fetchUse as a discovery and coverage layer, then upgrade important rows via corroboration or first-party reproduction.

mlx-lm pull requests

maintainer pull requests

highruntime change detection · benchmark corroborationMediumHighDirect fetchUse as an early benchmark and runtime-change source for MLX, then corroborate important results before treating them as strong product truth.

SharpAI HomeSec-Bench

published benchmark page

mediumbenchmark corroboration · apple silicon laptop coverageMediumHighDirect fetchUse as a structured external reference for M5 Pro laptop coverage and for cross-checking frontier local-agent claims before lab reproduction.

X local-AI chatter

social discovery

mediumearly discovery · maintainer watchLowHighManual onlyTreat as discovery only. Do not treat X posts as canonical evidence without a corroborating source that can be fetched, archived, and cited.