Bench

Audit truth, not just rows.

Bench is where Silicon Score audits evidence. It shows what is measured, what is still modeled, how runtimes are classified, which methodologies are comparable, and which evidence gaps are queued next.

Catalog state

Benchmark rows

432

Open issues

15

Reviewed issues

15

Research queue

30

Workflow comparisons

4

Model scorecards

10

Featured lab

Keep the default Mac tied to clean first-party evidence.

The featured environment is an operational pointer for Rankings and Bench. It stays on the owned M5 Max until the planned Mac Studio is physically verified and has publishable clean-recording rows.

Current default

MacBook Pro M5 Max 128GB

active

Public Rankings default and active first-party lab target: macbook-pro-m5-max-128gb-16.

Runbook

data/ops/bitter-mill/current-frontier-m5-max-2026-05.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-frontier-m5-max-2026-05.doctor.json

Planned successor

Mac Studio M4 Ultra 256GB

planned

Expected June 2026; provisional id mac-studio-m4-ultra-256gb stays out of the public catalog until verified.

Runbook

data/ops/bitter-mill/planned/m4-ultra-256gb-frontier-2026-06.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/m4-ultra-256gb-frontier-2026-06.doctor.json

Promotion gate

Validate before default changes

planned

In June 2026 the local lab is expected to receive a 256GB Mac Studio provisionally labeled M4 Ultra, replacing the M5 Max 128GB MacBook Pro as the featured environment only after arrival, hardware detection, and clean first-party evidence justify changing the public default.

Next command

npm run validate:lab-environments

Capture system_profiler hardware evidence for chip name, model identifier, GPU cores, and 256GB unified memory after the Mac Studio arrives.

Keep the current M5 Max 128GB machine as the featured public default until the Mac Studio has at least one clean-recording first-party Bitter Mill or Silicon Score Lab benchmark row.

Record clean-recording hygiene with memory pressure, thermal/performance warning, local-inference process, final snapshot, and zero swap I/O evidence before any benchmark row is promoted.

Verify Apple or hardware-detection naming before creating or promoting a public data/machines.json record for the planned machine.

Audit now

Current evidence gaps and queued work.

This board merges unresolved quality issues, frontier candidates without enough evidence, and active operator tasks. It prioritizes the gaps most likely to change Rankings or Bench confidence.

KindPriorityTitleActionWhy
operator queue287

Run Bitter Mill current-frontier batch on the owned M5 Max

Qwen3.6, MiniMax M2.7, Gemma 4 including E2B/E4B, Mistral Small 4, Ministral 3 compact models, Llama 4 Scout, gpt-oss 120B/20B, Nemotron Cascade 2, GLM-4.5-Air, Magistral Small, Devstral Small 2, and current Qwen coding/dense anchors should be reproduced from first-hand Bitter Mill runtime, quantization, and context sweeps before they become first-party measured evidence.

22 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Runbook

data/ops/bitter-mill/current-frontier-m5-max-2026-05.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-frontier-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-frontier-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-frontier-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batchSilicon Score's frontier relevance depends on first-hand local-inference data, but these runs are only useful if the machine is dedicated, the recording is clean, and the batch answers fit, fastest runtime, and highest-quality clean quantization for the featured M5 Max default. Use the runbook to keep a sampling hygiene monitor active through the Bitter Mill window, then import only rows that discovery and importer dry-run can resolve completely.
operator queue271

Run planned M4 Ultra 256GB frontier arrival batch

When the 256GB Mac Studio arrives, run a clean-recording Bitter Mill sweep across the active M5 current-frontier target set including MiniMax M2.7 plus high-memory and fit-boundary extras, runtimes, quantizations including current MLX dynamic low-bit profiles, and context ladders before changing Silicon Score's featured environment.

26 models · 1 source

Status: Blocked until arrival · M4 Ultra 256 GB stays queued until the machine is physically present, hardware identity is verified, and clean-recording preflight passes.

Runbook

data/ops/bitter-mill/planned/m4-ultra-256gb-frontier-2026-06.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/m4-ultra-256gb-frontier-2026-06.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-m4-ultra-256gb-frontier-2026-06 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/m4-ultra-256gb-frontier-2026-06/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened on the arrived Mac Studio. / The active all-runbooks readiness audit passes for checked-in active runbooks, then the planned M4 Ultra runbook's explicit check passes on the arrived Mac Studio.

first party bitter mill batchThe new Mac Studio should become the most interesting Silicon Score environment only after first-party rows prove which current frontier models fit, which compact and coding-agent lanes still matter, which low-bit and quality-retaining quantizations stay useful, and which runtime path is fastest on the exact machine.
operator queue239

Upgrade Qwen3.5-122B-A10B reference rows with Bitter Mill runs

Qwen3.5-122B-A10B is a high-memory frontier MoE candidate with trusted-reference rows and field speed reports, and this queued Bitter Mill import will test whether it belongs in high-end Mac advice.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-122b-a10b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-122b-a10b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-122b-a10b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill importLarge-MoE Apple Silicon advice is especially sensitive to memory pressure, active-parameter behavior, and runtime setup. First-party Bitter Mill traces should decide whether this remains a high-end Mac recommendation or a reference-only frontier row.
operator queue238

Upgrade Qwen3.5-397B-A17B reference rows with Bitter Mill runs

Qwen3.5-397B-A17B is the current high-memory Qwen3.5 flagship with sparse Apple Silicon reference evidence, and this queued Bitter Mill import keeps it behind clean first-party reproduction.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-397b-a17b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-397b-a17b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-397b-a17b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill importThis model can influence high-end Mac Studio and M5 Max advice, but it is especially sensitive to memory pressure and active-parameter behavior. Silicon Score should keep it behind clean-recording Bitter Mill reproduction until first-party traces prove whether it is a recommendation or a reference-only frontier row.
operator queue225

Run Bitter Mill current-reference batch on the owned M5 Max

Qwen 3 32B, Qwen 3 235B-A22B, Devstral Small 1.1, Qwen 3 8B, Nemotron-3-Nano, GLM-4.7-Flash, Phi-4 14B, Mistral Small 3.1, DeepSeek R1 distills, and Qwen 2.5 72B have Apple Silicon evidence or practitioner signals but should be treated as reference/baseline reproduction work rather than the current frontier lane. Qwen3.5-35B-A3B stays in the current-frontier M5 batch as the practical Qwen3.6 displacement comparator.

12 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Runbook

data/ops/bitter-mill/current-reference-m5-max-2026-05.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-reference-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-reference-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-reference-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batchThese rows are not the top Qwen3.6/Gemma 4 freshness lane, but they still shape baseline replacement decisions and historical speed curves. A single clean-recording Bitter Mill batch keeps them from becoming invisible stale-reference debt without crowding Bench with one task per model.
operator queue214

Run M5 Max 128 GB frontier anchors under clean-recording lab hygiene

Use the owned M5 Max 128 GB plan to convert frontier and coding-agent watchlist signals into first-party Silicon Score lab rows, but only when the recording window proves zero swap I/O and clean system pressure.

8 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:lab:m5:anchors:dry-run -- --preflight --clean-recording

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · 0 swap I/O pages

Gate: Actual run must occur on a detected M5 Max 128 GB machine; runner mismatch warnings are blockers for canonical factory_measured promotion. / Output benchmark records resolve chip, RAM, model, runtime, quantization, source date, and generation tok/s before append review.

lab verificationThe current host class is the fastest path to first-party high-memory Apple Silicon evidence. The public rankings should not lean on community M5 Max rows for Qwen, Devstral, Gemma, and Llama anchors when a clean-recording local lab plan can reproduce them under disclosed methodology.
operator queue202

Monitor premier Apple Silicon benchmark sites for freshness gaps

Track LLMCheck, oMLX, asiai, Anubis OSS, apple-silicon-llm-bench, mac-llm-bench, and broad LocalScore discovery so Silicon Score notices frontier releases, benchmark hygiene patterns, and UX/evidence gaps before public rankings go stale.

7 sources

Next command

npm run research:compact -- read --url https://llmcheck.net/benchmarks --query "Qwen3.6 Gemma 4 Apple Silicon benchmark methodology raw measurements" --maxChars 2400 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Every candidate source fact is backed by a fetched URL or captured artifact. / Freshness-review reads use research:compact --refresh so currentness checks do not silently reuse stale cache entries.

competitor freshness reviewThere is product alpha in disciplined R&D hygiene: know what serious reference sites cover, verify the claims with better provenance, then make Silicon Score more current, comprehensive, informative, and usable.
operator queue189

Corroborate GLM-5.1 Apple Silicon viability

Official GLM-5.1 metadata and local-serving docs are captured, plus row-level oMLX and Hugging Face MLX quantization evidence on M3 Ultra 512GB; these are directional field signals until first-party reproduction captures clean-recording hygiene.

1 model · 7 signals · 5 sources

Next command

npm run research:compact -- read --url https://huggingface.co/zai-org/GLM-5.1 --query "local deployment quantized GGUF MLX KTransformers GLM-5.1 Apple Silicon" --maxChars 3000 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: The official GLM-5.1 model metadata and release note are captured before public ranking copy treats it as current. / Any Apple Silicon row must explicitly identify GLM-5.1, hardware, runtime, quantization, context, and source date before it becomes a practitioner signal or benchmark candidate.

coverage expansionGLM-5.1 is now Z.AI's current flagship long-horizon agent model, but a 754B-class model only matters for Mac recommendations if extreme-memory Apple Silicon evidence is explicit. The captured M3 Ultra thread is useful viability signal, but it lacks exact runtime, context, quantization build, and methodology, so it should drive monitoring and first-party reproduction rather than ranking confidence.
operator queue186

Corroborate Mistral Small 4 Apple Silicon runtime

Mistral Small 4 has official local-serving support, Hugging Face MLX conversion cards, LLMCheck trusted-reference Apple Silicon rows, and SharpAI HomeSec-Bench M5 Pro 64GB llama.cpp field reports. The HomeSec rows make it a concrete reproduction candidate, but they remain directional community/operator evidence until first-party Bitter Mill captures setup, quantization, context, methodology, and hygiene sidecars.

1 model · 3 signals · 6 sources

Next command

npm run research:compact -- read --url https://www.sharpai.org/benchmark/ --query "Mistral-Small-4-119B Q2_K_XL UD-IQ1_M MacBook Pro M5 Pro 64GB llama.cpp tok/s TTFT HomeSec-Bench" --maxChars 5000 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Any Apple Silicon row must explicitly identify Mistral Small 4, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The SharpAI HomeSec-Bench M5 Pro 64GB rows are curated directional community_operator_report signals; use them to design first-party reproduction, not as canonical benchmark rows.

coverage expansionMistral Small 4 can change high-end Mac recommendations because it is a 119B MoE with only 6.5B active parameters and a 256K context window, but Silicon Score should not let generic local-serving support, competitor rows, or a domain benchmark alone substitute for setup-aware first-party Apple Silicon evidence.
operator queue184

Corroborate Magistral Small Apple Silicon runtime

Magistral Small has official 24B reasoning-model grounding, a fit note that the quantized model can run within a 32GB RAM MacBook, and Hugging Face MLX conversion cards, but the 2026-05-05 compact refresh found no row-level Apple Silicon throughput evidence in LLMCheck, oMLX, mac-llm-bench, broad search, LocalLLaMA search, or the MLX cards. The existing M5 Max Bitter Mill batch owns first-party measurement; this task keeps external/runtime corroboration explicit without promoting fit guidance into speed evidence.

1 model · 1 signal · 5 sources

Next command

npm run research:compact -- read --url https://huggingface.co/mistralai/Magistral-Small-2506 --query "Magistral Small 2506 local inference quantized MacBook 32GB context degradation GGUF MLX Apple Silicon speed" --maxChars 4200 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Any Apple Silicon row must explicitly identify Magistral Small 2506, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The official Hugging Face card and Mistral release post ground currentness, self-deployment, and runtime planning only; they do not establish Apple Silicon throughput.

coverage expansionMagistral Small is a current open reasoning lane that could matter for local Mac users because it is smaller than the high-end MoE frontier models and has official runtime paths through llama.cpp, LM Studio, Ollama, and MLX-adjacent packaging. Its 40k recommended context ceiling and lack of Apple Silicon speed rows mean it should not move ranking confidence until setup-aware throughput is measured or externally corroborated.
operator queue160

Corroborate Llama 3.3 70B cross-tier coverage

Llama 3.3 70B now has Apple Silicon rows across M1, M2, M3, M4, and M5-era tiers, including trusted-reference LLMCheck anchors on M5 Max and M4 Ultra.

1 model · 4 sources

Next command

npm run research:compact -- read --url https://llmcheck.net/data/benchmarks.json --query "Llama 3.3 70B Apple Silicon chip ram engine tps date" --maxChars 2400 --refresh

Gate: Any candidate coverage row must explicitly identify Llama 3.3 70B, Apple Silicon hardware, RAM tier or clearly bounded chip tier, runtime, quantization, reported speed, and source date. / New community or competitor rows stay directional unless methodology and provenance are strong enough for trusted-reference treatment.

coverage expansionThe buying curve now mixes trusted-reference rows with older community-unverified reports; the highest-value next step is first-party Bitter Mill or lab corroboration before those community rows shape public recommendations.
frontier hotspot116

GLM-5

GLM-5. It is not yet published in the current frontier packs. Benchmark evidence includes 5 Apple Silicon benchmark rows. 1 official model brief captured. 7 fetched artifacts, 1 blocked, missing, or partial. 4 curated practitioner signals, 3 Apple Silicon-specific. GLM-5: 6 Apple Silicon field reports; best reported generation ~20 tok/s; best reported prompt processing ~187 tok/s; reported RAM use ~391.82-415.41GB; seen on M3 ULTRA, Mac Studio M3 ULTRA 512GB; via oMLX.

1 model · 2 signals · 1 source

Next command

npm run research:compact -- read --url https://z.ai/blog/glm-5 --query "GLM-5 launch local deployment vLLM SGLang Apple Silicon evidence" --maxChars 4000 --refresh

Gate: Keep the original practitioner claim low-confidence while the source capture_status=http_error. / The official Z.AI launch blog and BigModel docs may support GLM-5 currentness and local-serving planning, but they do not resolve the unavailable practitioner source and must not become Apple Silicon throughput evidence.

practitioner source recoveryGLM-5 has a frontier-scale Apple Silicon operator claim whose original linked artifact still returns capture_status=http_error. Exact-title web recovery on 2026-05-05 found no direct repost or archive, so that original qualitative claim should remain low-confidence while the captured oMLX report carries the grounded setup, speed, batching, and prefill evidence.
quality issue90

Reviewed: m4-max-40-core-gpu--qwen-3-32b--q4-k-m--lab

Keep the M4 Max lab value as first-party setup-specific evidence while treating Qwen 3 32B as a reference-only baseline displaced by newer Qwen3.5 and Qwen3.6 records.

reviewed outlier keptKeep gen_tok_s = 22 as Silicon Score Lab evidence; Qwen 3 32B remains reference-only against newer Qwen3.5/Qwen3.6 lanes.
quality issue89

Reviewed: m3-max-gpu-count-not-published--qwen-3-32b--q8-0--reddit-m3-max-64gb-api-sweep-20172-llamacpp

Keep the low long-context llama.cpp value as directional community evidence; it is useful for context-scaling caution but should not drive short-prompt recommendations.

reviewed outlier keptKeep gen_tok_s = 7.58 as long-context community evidence; do not compare it directly with short-prompt Q4_K_M LLMCheck rows.
quality issue88

Reviewed: m3-max-gpu-count-not-published--qwen-3-32b--q8-0--reddit-m3-max-64gb-api-sweep-20172-ollama

Keep the low long-context Ollama value as directional community evidence; it is useful for context-scaling caution but should not drive short-prompt recommendations.

reviewed outlier keptKeep gen_tok_s = 7.53 as long-context community evidence; do not compare it directly with short-prompt Q4_K_M LLMCheck rows.
quality issue87

Reviewed: m1-pro-16-core-gpu--llama-2-7b--q4-0--llamacpp

Source review did not recover a RAM tier for the M1 Pro 16-core GPU llama.cpp baseline row, so the benchmark remains useful but exact-machine confidence must stay limited.

source limit documentedKeep machine_ram_gb null; no inferred RAM tier may be written from chip/GPU class alone.
quality issue86

Reviewed: m3-pro-18-core-gpu--llama-2-7b--q4-0--llamacpp

Source review did not recover a RAM tier for the M3 Pro 18-core GPU llama.cpp baseline row, so the benchmark remains useful but exact-machine confidence must stay limited.

source limit documentedKeep machine_ram_gb null; no inferred RAM tier may be written from chip/GPU class alone.

Competitor currentness

Keep external evidence current before Rankings drift.

These sources are tracked for frontier coverage, workflow UX, and benchmark hygiene. They inform research priorities; they do not bypass Silicon Score provenance or first-party reproduction.

Monitored surfaces

7

2 highest-priority Apple Silicon coverage sources

Weekly refresh loop

3

rerun with refreshed source reads before frontier conclusions move

Raw-data candidates

3

download or repository rows can enter import review after model grounding

Publication gates

7

competitor claims remain directional until the source-specific gate clears

SourceRoleCadenceCurrent signalSilicon Score responseGate and gapLinks
LLMCheck

highest

Rendered tableRaw downloadMethodologyCommunity verified
weeklyFetched May 4, 2026 with research:compact --refresh: the April 2026 benchmark page advertises 180+ benchmarks across 50 models, 12 Apple Silicon chips, and 3 engines, but the refreshed reader-visible table currently shows only the header followed by no visible rows. The open-data page and downloadable JSON remain the import-review surface with 122 measurements. The JSON reports version 2026-04-17 and a methodology of Q4_K_M quantization, 256-token input, 512-token output, 3 averaged runs, and a freshly booted system, while row-level quantization fields also include Q8_0. The raw schema covers model, params, quant, chip, RAM, engine, generation tok/s, TTFT, and date. The refreshed Silicon Score dry-run now maps all 122 source rows after adding official-source-grounded Ministral 3 8B and Ministral 3 14B current compact records, has zero missing mapped rows after import, has no unsupported quantizations, and reports zero skipped source labels. An expanded targeted frontier dry-run maps 32 requested Qwen3.6, Gemma 4, Mistral Small 4, Llama 4 Scout, gpt-oss 120B, and Nemotron Cascade 2 rows while explicitly reporting no current LLMCheck source rows for Devstral Small 2, GLM-4.5-Air, GLM-5.1, Magistral Small, or Qwen3.6-27B.Use as a high-signal reference import target, then upgrade Qwen3.6, Gemma 4, Mistral Small 4, Ministral 3 compact models, Llama 4 Scout, gpt-oss 120B, and Nemotron Cascade 2 rows with Bitter Mill or Silicon Score lab traces before treating them as first-party measured truth. Source-verified downloadable JSON rows now have a repeatable research:llmcheck:import dry-run path before append, including skipped-label reporting for any future LLMCheck-only names and row-level Q8_0 support. Gemma 4 E2B/E4B, Mistral Small 4, Ministral 3 8B/14B, Llama 4 Scout, gpt-oss 120B, Nemotron Cascade 2, Qwen3/Qwen3.5 reference baselines including Qwen3.5-4B, Qwen3.5-9B, Qwen 3 8B, Qwen 3 14B, Llama 3.1 8B, Mistral 7B v0.3, Phi-4 14B, Phi-4 Mini, Gemma 3 4B, Gemma 3 12B, DeepSeek R1 Distill Llama 8B, and DeepSeek R1 Distill Qwen 32B now have mapped trusted-reference coverage where official model grounding exists, while first-party reproduction remains queued for recommendation-changing conclusions.Downloadable JSON rows can become trusted-reference evidence only after official model verification; first-party Bitter Mill or lab traces are still required before ranking-changing conclusions.The rendered marketing count, reader-visible table, raw-data count, and requested-model coverage can diverge, so treat the benchmark page as a frontier radar and the 122-row CSV/JSON dataset as the import-review surface. Community submissions are verified against baselines rather than Silicon Score clean-recording sidecars; monitor both rendered and raw surfaces with --refresh before concluding coverage changed. The current full dry-run has no unmapped source labels after normalizing LLMCheck's generic Ministral 8B/14B labels to official Ministral 3 records; targeted dry-runs should still inspect requested_models_without_source_rows so high-priority absences such as Devstral Small 2, GLM-4.5-Air, GLM-5.1, Magistral Small, and Qwen3.6-27B become explicit source-monitoring gaps rather than silent non-matches. Any newly appearing LLMCheck-only model name, quantization, or methodology variant needs official-source and schema verification before canonical model or benchmark promotion.
Community tableRuntime contextPP/TG splitFrontier signal
weeklyFetched May 4, 2026 with research:compact --refresh: the community table moved to 169,913 submitted rows and the date-sorted first page continued same-day churn. The latest visible examples include Gemma 4 31B 8-bit rows on M3 Ultra 512 GB and Qwen3.6-35B-A3B-oQ5e 5-bit long-context rows on M4 Max 128 GB. Row-level refresh also found MiniMax-M2.7-ConfigI on M5 Max 128 GB at 4-bit with a 1k through 32k context sweep, TTFT, peak memory, oMLX version, and macOS fields. Earlier May 4 reads also included Qwen3.6 derivative or fine-tuned rows on M5 Max 128 GB, M4 Max 128 GB/36 GB, M4 Pro 48 GB, M3 Max 64 GB, M3 Ultra 512 GB, and M2 Max 96 GB, plus Gemma 4 26B-A4B-it on M2 Max 32 GB. Targeted Mistral Small 4 refreshes did not surface a row-level oMLX result for that exact model.Treat as a fast frontier radar for agentic MLX/oMLX workflows and queue first-party reproduction for any row that would affect rankings. MiniMax M2.7 is now tracked as a current Apple Silicon candidate because official sources and M5 Max oMLX field evidence intersect, but its community ConfigI row stays directional until Bitter Mill lineage, runtime, quantization, memory, and hygiene sidecars are captured. Derivative, fine-tuned, or otherwise variant model labels must resolve lineage before they can support a base canonical model; unresolved rows stay directional setup signals.Directional only until rows are deduplicated, derivative/fine-tune lineage is resolved, context-normalized, hygiene-reviewed, and reproduced first-party when they would move rankings.High-volume community submissions shift minute by minute, so exact totals are only timestamped source observations. Rows need deduplication, derivative or fine-tune lineage review, setup hygiene checks, context-length normalization, and Bitter Mill reproduction before any row can influence Silicon Score scores.
Native telemetryExportsLeaderboardStress reference
biweeklyFetched May 4, 2026 with research:compact --refresh: native macOS benchmark app for Apple Silicon and OpenAI-compatible endpoints with over 160 community submitted runs. It exports Markdown/CSV benchmark reports and raw history, supports leaderboard upload and filtering, and the refreshed GitHub read shows Anubis OSS v3.1 as the latest release on May 3, 2026 with reasoning-aware metrics and prefill speed. Telemetry includes tokens/sec, TTFT, process/model memory, GPU/CPU utilization, thermal state, GPU frequency, and GPU/CPU/ANE/DRAM power charts. The current README also documents backend identification for Ollama, LM Studio, mlx-lm, vLLM, LocalAI, and llama.cpp, plus process-memory accounting through phys_footprint so Metal/GPU allocations can be included.Use as a UX and instrumentation reference for first-party lab hygiene, especially export shape, telemetry breadth, and community upload flows. Capture promising leaderboard rows only as discovery signals until raw exports and setup context are reviewable.Treat leaderboard rows as discovery until comparable raw exports, setup context, and Silicon Score hygiene sidecars are captured.The public leaderboard is client-rendered and some metrics vary by backend exposure, so source stability and comparable exports need review before canonical use. Its telemetry breadth, reasoning-aware metric direction, prefill-speed presentation, backend detection, and phys_footprint memory accounting are useful targets for Silicon Score tooling, but competitor/community rows stay directional without Silicon Score hygiene sidecars.
asiai

high

MethodologyAgent APIBenchmark toolHygiene reference
weeklyFetched May 4, 2026 with research:compact --refresh: benchmark best-practices v0.3.2 documents TTFT separation, deterministic sampling, warmup runs, median primary metrics, per-engine power monitoring, engine/model/hardware metadata, sequential engine execution, thermal-speed-limit recording, and context/KV-cache warnings. The April 25, 2026 agent guide exposes MCP/REST status surfaces for swap_used_gb, memory_pressure, thermal_state, GPU utilization, GPU renderer/tiler activity, GPU memory allocation, engine detection, model-loaded state, and benchmark history.Use as an R&D hygiene and UX benchmark for local inference observability, especially memory isolation, engine detection, thermal state, context allocation, and agent-readable status surfaces. Keep Silicon Score's clean-recording sidecar gates for first-party Bitter Mill imports, with strict-zero-swap only where a runbook deliberately opts into that profile.Use for Silicon Score tooling and run hygiene; benchmark rows need canonical methodology mapping and Silicon Score clean-recording sidecars before publication.Shared leaderboard rows still need canonical methodology mapping. The fetched best-practices page warns against simultaneous engine benchmarks and highlights long-context KV-cache allocation as a memory-saturation risk. The agent guide treats nonzero swap as warning/danger bands; Silicon Score should keep numeric swap I/O, thermal state, context allocation, memory-pressure thresholds, and final clean-recording snapshots visible before publication rather than treating old swap allocation alone as contamination.
Repository dataRuntime comparisonContext matrixMemory measurement
monthlyFetched May 4, 2026 with research:compact --refresh: systematic M3 Max 64 GB dataset reports 791 measurements, 57 configurations, 8 backends, 7 current models, 6 quantization formats, and 7 KV-cache strategies. It measures cold/warm TTFT across realistic context depths, decode throughput, peak RSS memory, server lifecycle, warmups, 3 measured runs, and teardown.Use as a structured runtime-comparison reference for oMLX, vllm-mlx, MLX, llama.cpp, Ollama, LM Studio, and Docker Model Runner behavior on one Apple Silicon tier.Methodology and runtime-shape reference only until any row is corroborated or reproduced on the target Silicon Score machine tier.Single-machine M3 Max coverage is not a buying curve, so it should inform runtime methodology and candidate selection rather than broad Mac recommendations.
Broad leaderboardDirect resultHistorical Mac rows
biweeklyFetched May 4, 2026 with research:compact --refresh: the homepage renders a broad local-AI benchmark leaderboard whose first visible rows are NVIDIA/GPU-heavy rather than Apple-Silicon-specific. Individual accelerator and result pages remain useful when directly linked.Keep as broad hardware discovery and historical Apple Silicon coverage, but prefer Silicon Score/Bitter Mill, LLMCheck, oMLX, and Apple-specific tools for frontier ranking movement.Broad discovery only unless a direct Apple Silicon result URL has enough methodology and provenance for review.It is not Apple-only, the current homepage first viewport is not Mac-centered, and the old accelerator index route is unstable. Use individual result or accelerator pages for provenance rather than copying aggregate rankings.
Repository dataapple silicon coverage discoveryRuntime comparisonquality benchmark cautioncommunity tooling reference
monthlyFetched May 4, 2026 with research:compact --refresh: the repo describes an Apple-Silicon-only community benchmark database with M5 coverage at 62 benchmarks across 37 GGUF and 25 MLX runs, plus M2 Max coverage. It supports GGUF llama-bench, MLX mlx_lm.benchmark, HumanEval+ quality checks, raw result JSON, result schemas, hardware detection, and generated per-generation result tables. A same-source LocalLLaMA post reports a MacBook Air M5 32GB Q4_K_M llama-bench sweep across 37 models, including Gemma 4 E2B at 29.2 tg128 tok/s and Gemma 4 26B-A4B at 16.2 tg128 tok/s.Track as a useful Mac-specific competitor/tooling source and candidate-discovery feed. Use its methodology, schemas, and base-M5 coverage to improve Silicon Score's reproduction queue and UX, but keep rows directional unless exact raw JSON, runtime, quantization, machine tier, library versions, and source date are captured.No direct ranking movement from aggregate tables. Any candidate row needs raw result review plus corroboration or strict-zero-swap Silicon Score/Bitter Mill reproduction before trusted-reference or first-party promotion.Hardware coverage is still sparse, default GGUF and MLX quantization paths differ, HumanEval+ quality is a narrow coding benchmark, and the repo flags Gemma 4 quality results as potentially affected by a llama.cpp tool-calling or premature-stop issue.

Evidence pressure

Coverage is still the main weakness.

Metadata debt is now small. The bigger issue is still how much of the catalog depends on community rows and thin runtime coverage.

Macs

29

Models

55

Source captures

110

Evidence classRowsShare
community unverified28065%
trusted reference15135%
factory measured10%

Unresolved issues

Remaining gaps still need evidence.

Reviewed: m4-max-40-core-gpu--qwen-3-32b--q4-k-m--lab

Keep the M4 Max lab value as first-party setup-specific evidence while treating Qwen 3 32B as a reference-only baseline displaced by newer Qwen3.5 and Qwen3.6 records.

Reviewed value kept

reviewed outlier kept

Reviewed: m3-max-gpu-count-not-published--qwen-3-32b--q8-0--reddit-m3-max-64gb-api-sweep-20172-llamacpp

Keep the low long-context llama.cpp value as directional community evidence; it is useful for context-scaling caution but should not drive short-prompt recommendations.

Reviewed value kept

reviewed outlier kept

Reviewed: m3-max-gpu-count-not-published--qwen-3-32b--q8-0--reddit-m3-max-64gb-api-sweep-20172-ollama

Keep the low long-context Ollama value as directional community evidence; it is useful for context-scaling caution but should not drive short-prompt recommendations.

Reviewed value kept

reviewed outlier kept

Reviewed: m1-pro-16-core-gpu--llama-2-7b--q4-0--llamacpp

Source review did not recover a RAM tier for the M1 Pro 16-core GPU llama.cpp baseline row, so the benchmark remains useful but exact-machine confidence must stay limited.

Reviewed source limit

source limit documented

Reviewed: m3-pro-18-core-gpu--llama-2-7b--q4-0--llamacpp

Source review did not recover a RAM tier for the M3 Pro 18-core GPU llama.cpp baseline row, so the benchmark remains useful but exact-machine confidence must stay limited.

Reviewed source limit

source limit documented

Reviewed: m4-max--qwen-3-32b--q4-k-m--estsauver-lm

Keep the LM Studio 10K-context value as trusted-reference evidence while preserving setup sensitivity in ranking explanations.

Reviewed value kept

reviewed outlier kept

Runtime taxonomy

Why runtime is a facet, not a separate page.

Public runtime locks should stay simple. Backends and wrappers still matter, but they belong here as audit semantics rather than top-level navigation.

RuntimeLayerPublic lockRowsWhy it exists

Llamafile

Llamafile wrapper on llama.cpp

Wrapper

llama.cpp stack

Public filter: llama.cpp stack.

167Mozilla's portable llama.cpp packaging and CLI. LocalScore runs on top of this stack for submitted benchmark results. Latest version verified from official GitHub releases on 2026-03-25.

MLX

MLX backend

Backend

MLX

Public filter: MLX.

96Apple's ML framework optimized for Apple Silicon unified memory. Uses MLX-format weights (not GGUF). Often faster than llama.cpp on Apple hardware for supported models. Version approximate as of March 2026.

Ollama

Ollama wrapper on llama.cpp

Wrapper

Ollama · llama.cpp stack

Public filter: Ollama · llama.cpp stack.

84Wraps llama.cpp with model management, API server, and automatic quantization selection. Most common runtime in community benchmarks. Metal GPU acceleration on macOS.

llama.cpp

llama.cpp backend

Backend

llama.cpp stack

Public filter: llama.cpp stack.

49Reference C++ inference engine. GGUF model format. Metal backend for Apple Silicon. Offers llama-bench for standardized benchmarking. Version is approximate build number as of March 2026.

LM Studio

LM Studio wrapper on mixed

Wrapper

Audit only

Do not expose as a canonical runtime lock.

33Desktop application wrapping llama.cpp and MLX. GUI-based model management. Supports GGUF and MLX formats. Version approximate as of March 2026.

Methodology comparability

Comparability groups matter as much as raw speed.

MethodologyComparabilityRuntime versionRowsNotes
LocalScore Standard BenchmarklocalscoreNot fixed167Standard LocalScore benchmark methodology. Uses fixed prompt templates across model sizes. Public submissions come from the official LocalScore CLI, which LocalScore documents as running on top of Llamafile.
LLMCheck Standard Q4_K_M Apple Siliconllmcheck_standard_q4kmNot fixed113LLMCheck open Apple Silicon benchmark rows. The source documents Q4_K_M quantization unless otherwise noted, a standardized 256-token input prompt, 512 generated tokens, three-run averages, and generation tok/s excluding prompt processing.
Reddit Runtime Speed Notereddit_runtime_notesNot fixed24Short Reddit field reports with an explicit Apple Silicon machine, runtime path, and approximate tokens-per-second figure. Useful for runtime and constrained-tier coverage, but lighter-weight and less comparable than structured benchmark tables.
Reddit OpenAI API Context Sweepreddit_context_sweep_benchmarksRuntime version reported in post when available20Structured Reddit benchmark posts that drive local runtimes through an OpenAI-compatible API and publish exact prompt-token counts, TTFT, prompt-processing speed, and decode speed across increasing contexts. These rows remain community-reported, but they are materially stronger than anecdotal notes because they expose cross-runtime behavior under the same scripted request path.
Reddit Operator Benchmark Reportreddit_operator_benchmarksNot fixed20Structured practitioner benchmark reports posted to Reddit with explicit machine, runtime, quantization, and generation-speed tables. Useful as community benchmark evidence, but methodology varies and should not be treated as directly comparable to lab runs.
Reddit Standardized Eval Reportreddit_standardized_evalsNot fixed12Structured Apple Silicon evaluation reports posted to Reddit with explicit model, quantization, decode speed, RAM use, and task scores. Useful for frontier coverage but still community-reported rather than independently reproduced.
Awni MLX Benchmark (GitHub Gist)mlx_communityNot fixed10MLX benchmarks published by Awni Hannun (MLX team) as GitHub Gists. Uses mlx-lm generate with specific prompts. Methodology varies between gists; check individual source for exact parameters.
estsauver LM Studio Benchmarklm_studio_communityNot fixed9Community benchmarks run via LM Studio. Methodology details partially documented. Check source posts for specific LM Studio version and settings.
LLMCheck Standard Q8_0 Apple Siliconllmcheck_standard_q8Not fixed9LLMCheck open Apple Silicon benchmark rows where the raw source row reports Q8_0 quantization. The source documents a standardized 256-token input prompt, 512 generated tokens, three-run averages, and generation tok/s excluding prompt processing, while the row-level quantization field overrides the page-level Q4_K_M default.
Mac Mini M4 16GB GGUF Sweep v1community_batch_benchmarkllama-server (version not fixed in Silicon Score)8Large automated GGUF benchmark sweep on a Mac Mini M4 16GB. The source reports throughput at 1024 and 4096 context, concurrency 1 and 2, plus TTFT and quality lanes. Silicon Score publishes only a small curated subset of rows and uses the average of the two single-request throughput columns as the decode-speed summary.
Reddit LM Studio Context Sweepreddit_context_sweep_benchmarksLM Studio (version not fixed in Silicon Score)8Structured Apple Silicon benchmark posts with raw LM Studio-reported prompt-processing and generation output across increasing context sizes. These rows remain community-reported, but they are materially stronger than generic anecdotes because they expose exact context lengths and speed degradation under longer coding-style prompts.
Reddit MLX Context Sweepreddit_context_sweep_benchmarksmlx-lm (version reported in post when available)7Direct Apple Silicon benchmark posts with raw mlx_lm.generate output across multiple prompt lengths. These rows are stronger than generic operator anecdotes because they expose exact prompt token counts, generation speed, and peak memory.
llama-bench Defaultllamacpp_benchNot fixed6Default llama-bench settings. Runs prompt processing (pp512) and text generation (tg128) benchmarks. Highly reproducible when llama.cpp version is controlled. batch_size refers to prompt processing batch.
Reddit oMLX Context Sweepreddit_context_sweep_benchmarksoMLX (version not fixed in Silicon Score)5Structured Apple Silicon benchmark posts with raw oMLX output across multiple prompt lengths. These rows expose TTFT, prompt speed, decode speed, and peak memory, but remain community-reported and backend-specific.
SharpAI HomeSec-Bench v1published_agent_benchmarksllama.cpp (version not fixed in Silicon Score)4Published SharpAI benchmark page comparing local Apple Silicon models and cloud models on the HomeSec-Bench task suite. Silicon Score uses the page's reported average TTFT, decode speed, and memory numbers as structured reference evidence rather than as directly lab-comparable canon.
Reddit llama-bench Context Sweepreddit_context_sweep_benchmarksllama.cpp (version not fixed in Silicon Score)3Structured Apple Silicon benchmark posts with raw llama-bench prompt-processing and generation throughput across multiple KV depths. These rows are community-reported, but materially stronger than casual speed notes because they publish the exact benchmark command, prompt length, generation length, and per-depth throughput table.
GitHub PR Benchmark Tablemlx_communityfeature branch or pull request build when stated2Benchmark tables shared inside GitHub pull request threads for runtime changes. Useful for before-and-after deltas on Apple Silicon, but prompt details and run controls are usually only partially disclosed.
Published Runtime Benchmark Article v1published_runtime_benchmarksruntime versions reported in the article when available2Independent benchmark articles that disclose an explicit Apple Silicon machine, model, runtime stack, quantization, and reported speed counters from a reproducible local benchmark setup. Silicon Score uses these as structured community reference evidence rather than as directly lab-comparable canon.
Hugging Face Model Card Benchmarkcommunity_model_card_benchmarkNot fixed1Single-machine benchmark figures published in Hugging Face model cards or README files for a specific quantized release. These rows are useful for Apple Silicon frontier coverage when the source gives an explicit machine, runtime, and throughput number, but settings remain source-defined rather than lab-standardized.
Hugging Face Quantization Card Ollama Benchmarkhf_quantization_card_benchmarksNot fixed1Benchmark tables published on Hugging Face quantization cards with explicit Apple Silicon hardware, quantization, runtime, and tok/s fields. Treat as community-reference evidence unless independently reproduced.
Silicon Score Lab Ollama v1silicon_score_labollama 0.6.21First-party controlled benchmarks from the Silicon Score lab. Ollama with fixed version, standardized prompt, 3 warmup runs, 5 measured runs, median reported. The lab runner records macOS memory/swap hygiene at preflight and rejects runs when swap I/O moves during a benchmark request. Full methodology disclosed with each result set.
Silicon Score Bitter Mill Import v1silicon_score_labNot fixed0First-party Bitter Mill and local inference engine traces normalized into canonical benchmark rows. Imported rows must identify the machine, runtime, model, quantization, source date, and measured generation throughput before publication. Operators should run the clean-recording benchmark hygiene preflight before Bitter Mill captures; imports require auditable available memory, memory-pressure percentage, final snapshot evidence, no thermal/performance warning, and zero swap-I/O activity during capture. Historical swap allocation within the clean-recording threshold is advisory rather than publication-blocking.

Frontier hotspots

Where frontier interest outruns evidence.

Run Bitter Mill current-frontier batch on the owned M5 Max

22 frontier watchlist rows share this operator task: Qwen3.6-27B, Gemma 4 31B, Qwen3.5-27B, Qwen3.6-35B-A3B, Devstral Small 2 24B, Mistral Small 4 119B, and 16 more. Qwen3.6, MiniMax M2.7, Gemma 4 including E2B/E4B, Mistral Small 4, Ministral 3 compact models, Llama 4 Scout, gpt-oss 120B/20B, Nemotron Cascade 2, GLM-4.5-Air, Magistral Small, Devstral Small 2, and current Qwen coding/dense anchors should be reproduced from first-hand Bitter Mill runtime, quantization, and context sweeps before they become first-party measured evidence.

22 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-frontier-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-frontier-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-frontier-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batch

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B. It appears across 4 lenses and 1 budget slice. Benchmark evidence includes 2 Apple Silicon benchmark rows. 1 official model brief captured. 8 fetched artifacts. 5 curated practitioner signals, 5 Apple Silicon-specific. Qwen3.5-397B-A17B: 4 Apple Silicon field reports; best reported generation ~30.81 tok/s; best reported prompt processing ~122.46 tok/s; seen on MacBook Pro M5 MAX 128GB; via llama.cpp, flash-moe, MLX.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-397b-a17b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-397b-a17b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-397b-a17b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill import

Qwen3.5-122B-A10B

Qwen3.5-122B-A10B. It appears across 4 lenses and 3 budget slices. Benchmark evidence includes 6 Apple Silicon benchmark rows. 1 official model brief captured. 15 fetched artifacts, 1 blocked, missing, or partial. 16 curated practitioner signals, 16 Apple Silicon-specific. Qwen3.5-122B-A10B: 24 Apple Silicon field reports; best reported generation ~65.853 tok/s; best reported prompt processing ~1239.734 tok/s; reported RAM use ~71.91-102GB; seen on MacBook Pro M5 MAX 128GB, Mac Studio M3 ULTRA 256GB, Mac Studio M4 MAX 128GB; via MLX, oMLX, EXO over Thunderbolt 5 RDMA, llama.cpp.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-122b-a10b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-122b-a10b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-122b-a10b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill import

GLM-5.1

GLM-5.1. It is not yet published in the current frontier packs. No canonical Apple Silicon benchmark rows yet; field speed reports are still directional. 3 official model briefs captured. 13 fetched artifacts. 7 curated practitioner signals, 6 Apple Silicon-specific. GLM-5.1: 7 Apple Silicon field reports; best reported generation ~19.527 tok/s; best reported prompt processing ~194.216 tok/s; reported RAM use ~251-382.6GB; seen on Mac Studio M3 ULTRA 512GB, Mac Studio M3 ULTRA 256GB; via MLX, oMLX.

1 model · 7 signals · 5 sources

Next command

npm run research:compact -- read --url https://huggingface.co/zai-org/GLM-5.1 --query "local deployment quantized GGUF MLX KTransformers GLM-5.1 Apple Silicon" --maxChars 3000 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: The official GLM-5.1 model metadata and release note are captured before public ranking copy treats it as current. / Any Apple Silicon row must explicitly identify GLM-5.1, hardware, runtime, quantization, context, and source date before it becomes a practitioner signal or benchmark candidate.

coverage expansion

Run M5 Max 128 GB frontier anchors under clean-recording lab hygiene

4 frontier watchlist rows share this operator task: Qwen3-Coder-30B-A3B, Llama 3.3 70B, Qwen 3 30B-A3B, Gemma 3 27B. Use the owned M5 Max 128 GB plan to convert frontier and coding-agent watchlist signals into first-party Silicon Score lab rows, but only when the recording window proves zero swap I/O and clean system pressure.

8 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:lab:m5:anchors:dry-run -- --preflight --clean-recording

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · 0 swap I/O pages

Gate: Actual run must occur on a detected M5 Max 128 GB machine; runner mismatch warnings are blockers for canonical factory_measured promotion. / Output benchmark records resolve chip, RAM, model, runtime, quantization, source date, and generation tok/s before append review.

lab verification

GLM-5

GLM-5. It is not yet published in the current frontier packs. Benchmark evidence includes 5 Apple Silicon benchmark rows. 1 official model brief captured. 7 fetched artifacts, 1 blocked, missing, or partial. 4 curated practitioner signals, 3 Apple Silicon-specific. GLM-5: 6 Apple Silicon field reports; best reported generation ~20 tok/s; best reported prompt processing ~187 tok/s; reported RAM use ~391.82-415.41GB; seen on M3 ULTRA, Mac Studio M3 ULTRA 512GB; via oMLX.

1 model · 2 signals · 1 source

Next command

npm run research:compact -- read --url https://z.ai/blog/glm-5 --query "GLM-5 launch local deployment vLLM SGLang Apple Silicon evidence" --maxChars 4000 --refresh

Gate: Keep the original practitioner claim low-confidence while the source capture_status=http_error. / The official Z.AI launch blog and BigModel docs may support GLM-5 currentness and local-serving planning, but they do not resolve the unavailable practitioner source and must not become Apple Silicon throughput evidence.

practitioner source recovery

Run Bitter Mill current-reference batch on the owned M5 Max

11 frontier watchlist rows share this operator task: Devstral Small 1.1, Mistral Small 3.1 24B, Qwen 3 235B-A22B, DeepSeek R1 Distill Qwen 32B, GLM-4.7-Flash, DeepSeek R1 Distill Llama 70B, and 5 more. Qwen 3 32B, Qwen 3 235B-A22B, Devstral Small 1.1, Qwen 3 8B, Nemotron-3-Nano, GLM-4.7-Flash, Phi-4 14B, Mistral Small 3.1, DeepSeek R1 distills, and Qwen 2.5 72B have Apple Silicon evidence or practitioner signals but should be treated as reference/baseline reproduction work rather than the current frontier lane. Qwen3.5-35B-A3B stays in the current-frontier M5 batch as the practical Qwen3.6 displacement comparator.

12 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-reference-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-reference-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-reference-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batch

Operator queue

Concrete work already queued.

Run Bitter Mill current-frontier batch on the owned M5 Max

Qwen3.6, MiniMax M2.7, Gemma 4 including E2B/E4B, Mistral Small 4, Ministral 3 compact models, Llama 4 Scout, gpt-oss 120B/20B, Nemotron Cascade 2, GLM-4.5-Air, Magistral Small, Devstral Small 2, and current Qwen coding/dense anchors should be reproduced from first-hand Bitter Mill runtime, quantization, and context sweeps before they become first-party measured evidence.

22 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Runbook

data/ops/bitter-mill/current-frontier-m5-max-2026-05.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-frontier-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-frontier-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-frontier-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batch

Run planned M4 Ultra 256GB frontier arrival batch

When the 256GB Mac Studio arrives, run a clean-recording Bitter Mill sweep across the active M5 current-frontier target set including MiniMax M2.7 plus high-memory and fit-boundary extras, runtimes, quantizations including current MLX dynamic low-bit profiles, and context ladders before changing Silicon Score's featured environment.

26 models · 1 source

Status: Blocked until arrival · M4 Ultra 256 GB stays queued until the machine is physically present, hardware identity is verified, and clean-recording preflight passes.

Runbook

data/ops/bitter-mill/planned/m4-ultra-256gb-frontier-2026-06.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/m4-ultra-256gb-frontier-2026-06.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-m4-ultra-256gb-frontier-2026-06 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/m4-ultra-256gb-frontier-2026-06/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened on the arrived Mac Studio. / The active all-runbooks readiness audit passes for checked-in active runbooks, then the planned M4 Ultra runbook's explicit check passes on the arrived Mac Studio.

first party bitter mill batch

Upgrade Qwen3.5-122B-A10B reference rows with Bitter Mill runs

Qwen3.5-122B-A10B is a high-memory frontier MoE candidate with trusted-reference rows and field speed reports, and this queued Bitter Mill import will test whether it belongs in high-end Mac advice.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-122b-a10b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-122b-a10b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-122b-a10b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill import

Upgrade Qwen3.5-397B-A17B reference rows with Bitter Mill runs

Qwen3.5-397B-A17B is the current high-memory Qwen3.5 flagship with sparse Apple Silicon reference evidence, and this queued Bitter Mill import keeps it behind clean first-party reproduction.

1 model · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-397b-a17b.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-397b-a17b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-397b-a17b.system-hygiene.json

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.

first party bitter mill import

Run Bitter Mill current-reference batch on the owned M5 Max

Qwen 3 32B, Qwen 3 235B-A22B, Devstral Small 1.1, Qwen 3 8B, Nemotron-3-Nano, GLM-4.7-Flash, Phi-4 14B, Mistral Small 3.1, DeepSeek R1 distills, and Qwen 2.5 72B have Apple Silicon evidence or practitioner signals but should be treated as reference/baseline reproduction work rather than the current frontier lane. Qwen3.5-35B-A3B stays in the current-frontier M5 batch as the practical Qwen3.6 displacement comparator.

12 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Runbook

data/ops/bitter-mill/current-reference-m5-max-2026-05.json

Next command

npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-reference-m5-max-2026-05.doctor.json

Monitor command

npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-reference-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-reference-m5-max-2026-05/system-hygiene.json

Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling

Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.

first party bitter mill batch

Run M5 Max 128 GB frontier anchors under clean-recording lab hygiene

Use the owned M5 Max 128 GB plan to convert frontier and coding-agent watchlist signals into first-party Silicon Score lab rows, but only when the recording window proves zero swap I/O and clean system pressure.

8 models · 1 source

Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.

Next command

npm run bench:lab:m5:anchors:dry-run -- --preflight --clean-recording

Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · 0 swap I/O pages

Gate: Actual run must occur on a detected M5 Max 128 GB machine; runner mismatch warnings are blockers for canonical factory_measured promotion. / Output benchmark records resolve chip, RAM, model, runtime, quantization, source date, and generation tok/s before append review.

lab verification

Monitor premier Apple Silicon benchmark sites for freshness gaps

Track LLMCheck, oMLX, asiai, Anubis OSS, apple-silicon-llm-bench, mac-llm-bench, and broad LocalScore discovery so Silicon Score notices frontier releases, benchmark hygiene patterns, and UX/evidence gaps before public rankings go stale.

7 sources

Next command

npm run research:compact -- read --url https://llmcheck.net/benchmarks --query "Qwen3.6 Gemma 4 Apple Silicon benchmark methodology raw measurements" --maxChars 2400 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Every candidate source fact is backed by a fetched URL or captured artifact. / Freshness-review reads use research:compact --refresh so currentness checks do not silently reuse stale cache entries.

competitor freshness review

Corroborate GLM-5.1 Apple Silicon viability

Official GLM-5.1 metadata and local-serving docs are captured, plus row-level oMLX and Hugging Face MLX quantization evidence on M3 Ultra 512GB; these are directional field signals until first-party reproduction captures clean-recording hygiene.

1 model · 7 signals · 5 sources

Next command

npm run research:compact -- read --url https://huggingface.co/zai-org/GLM-5.1 --query "local deployment quantized GGUF MLX KTransformers GLM-5.1 Apple Silicon" --maxChars 3000 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: The official GLM-5.1 model metadata and release note are captured before public ranking copy treats it as current. / Any Apple Silicon row must explicitly identify GLM-5.1, hardware, runtime, quantization, context, and source date before it becomes a practitioner signal or benchmark candidate.

coverage expansion

Corroborate Mistral Small 4 Apple Silicon runtime

Mistral Small 4 has official local-serving support, Hugging Face MLX conversion cards, LLMCheck trusted-reference Apple Silicon rows, and SharpAI HomeSec-Bench M5 Pro 64GB llama.cpp field reports. The HomeSec rows make it a concrete reproduction candidate, but they remain directional community/operator evidence until first-party Bitter Mill captures setup, quantization, context, methodology, and hygiene sidecars.

1 model · 3 signals · 6 sources

Next command

npm run research:compact -- read --url https://www.sharpai.org/benchmark/ --query "Mistral-Small-4-119B Q2_K_XL UD-IQ1_M MacBook Pro M5 Pro 64GB llama.cpp tok/s TTFT HomeSec-Bench" --maxChars 5000 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Any Apple Silicon row must explicitly identify Mistral Small 4, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The SharpAI HomeSec-Bench M5 Pro 64GB rows are curated directional community_operator_report signals; use them to design first-party reproduction, not as canonical benchmark rows.

coverage expansion

Corroborate Magistral Small Apple Silicon runtime

Magistral Small has official 24B reasoning-model grounding, a fit note that the quantized model can run within a 32GB RAM MacBook, and Hugging Face MLX conversion cards, but the 2026-05-05 compact refresh found no row-level Apple Silicon throughput evidence in LLMCheck, oMLX, mac-llm-bench, broad search, LocalLLaMA search, or the MLX cards. The existing M5 Max Bitter Mill batch owns first-party measurement; this task keeps external/runtime corroboration explicit without promoting fit guidance into speed evidence.

1 model · 1 signal · 5 sources

Next command

npm run research:compact -- read --url https://huggingface.co/mistralai/Magistral-Small-2506 --query "Magistral Small 2506 local inference quantized MacBook 32GB context degradation GGUF MLX Apple Silicon speed" --maxChars 4200 --refresh

Hygiene gate: Clean recording · pre-existing swap warning-only

Gate: Any Apple Silicon row must explicitly identify Magistral Small 2506, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The official Hugging Face card and Mistral release post ground currentness, self-deployment, and runtime planning only; they do not establish Apple Silicon throughput.

coverage expansion

Corroborate Llama 3.3 70B cross-tier coverage

Llama 3.3 70B now has Apple Silicon rows across M1, M2, M3, M4, and M5-era tiers, including trusted-reference LLMCheck anchors on M5 Max and M4 Ultra.

1 model · 4 sources

Next command

npm run research:compact -- read --url https://llmcheck.net/data/benchmarks.json --query "Llama 3.3 70B Apple Silicon chip ram engine tps date" --maxChars 2400 --refresh

Gate: Any candidate coverage row must explicitly identify Llama 3.3 70B, Apple Silicon hardware, RAM tier or clearly bounded chip tier, runtime, quantization, reported speed, and source date. / New community or competitor rows stay directional unless methodology and provenance are strong enough for trusted-reference treatment.

coverage expansion

Model scorecards

Smart and fast — not just fast.

These standardized Apple Silicon evals add task-shape truth to the catalog. They come from one fixed high-end Mac and runtime path, so they should not override the main rankings, but they are excellent for understanding which models are fast, balanced, coding-heavy, or tool-soft.

10 scorecardsSingle-machine comparability
ModelMachineQuantAvgToolsCodeReasonDecodeRAMWhy it matters

Qwen3.5-122B-A10B

vLLM-MLX

Mac Studio M3 Ultra 256GB8bit89%87%90%90%42.7 tok/s129.8 GB

Highest overall quality in this standardized set, but it demands real memory.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-122B-A10B

vLLM-MLX

Mac Studio M3 Ultra 256GBmxfp488%90%90%80%57.0 tok/s65.0 GB

The best value version in this scorecard: near-frontier quality at roughly half the RAM.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-35B-A3B

vLLM-MLX

Mac Studio M3 Ultra 256GB8bit85%90%90%80%80.0 tok/s36.9 GB

The stronger version of the 35B MoE story: fast and much more balanced.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3-Coder-Next

vLLM-MLX

Mac Studio M3 Ultra 256GB6bit82%87%90%80%65.6 tok/s64.8 GB

Slightly slower than 4-bit, but reasoning is stronger and coding stays high.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3-Coder-Next

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit80%90%90%70%73.5 tok/s44.9 GB

The fast coding-first option in this scorecard, with strong tool behavior.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

GLM-4.5-Air

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit78%73%90%70%53.6 tok/s60.3 GB

More balanced than the flash variant, but materially heavier.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-27B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit76%83%90%50%37.7 tok/s15.3 GB

A strong fits-anywhere coding and tool-use compromise.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-35B-A3B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit74%87%90%50%95.2 tok/s19.6 GB

Very fast for its size, but reasoning softness is visible in the standardized tasks.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Qwen3.5-9B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit71%83%70%60%106.4 tok/s5.1 GB

The smallest model in this set that still looks broadly useful for agent-style work.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Devstral Small 2 24B

vLLM-MLX

Mac Studio M3 Ultra 256GB4bit62%17%90%70%47.2 tok/s13.4 GB

Strong coding score, but tool calling is poor in this standardized setup.

vLLM-MLX SCORECARD.md  ·  discussion · 2026-03-04

Workflow metrics

Keeping headline tok/s honest when runtime varies.

These records capture effective throughput and prefill-heavy scenarios from the same Mac and model across different runtime paths. They are valuable for teaching runtime choice, but they should stay in the audit lane rather than flatten into headline tokens-per-second rows.

4 workflow comparisons17 runtime results
ModelMachineScenarioBest runtimeSpreadSourceWhy it matters

Qwen 3 30B-A3B

Effective tok/s · Interactive

MacBook Pro M1 Max 64GBOps agent (8 turns)LM Studio (41.7)0.3 tok/s

These are effective throughput figures on a multi-turn ops-agent scenario. They include prefill and wrapper behavior, so they should teach runtime choice, not replace decode-speed benchmark rows.

LM Studio 41.7 · llama.cpp 41.4 · oMLX 38.0 · Ollama 26.0

Qwen3.5-35B-A3B

Effective tok/s · Interactive

MacBook Pro M1 Max 64GBOps agent (8 turns)oMLX (38.0)2.4 tok/s

These are effective throughput results on an ops-agent workflow. They are best used to compare runtime behavior and caching quality on Apple Silicon, not to replace canonical decode-speed rows.

oMLX 38.0 · Rapid-MLX 35.6 · mlx-openai-server 26.2 · LM Studio (GGUF) 17.6 · LM Studio (MLX) 17.0

Qwen3.5-35B-A3B

Effective tok/s · 8,000 ctx

MacBook Pro M1 Max 64GBPrefill stress (8K context)oMLX (16.4)7.7 tok/s

This is an 8K prefill-stress comparison. It is useful for understanding caching and long-context behavior, not for headline decode-speed ranking.

oMLX 16.4 · mlx-openai-server 8.7 · Rapid-MLX 8.5 · LM Studio (GGUF) 7.8 · LM Studio (MLX) 5.9

Qwen 3 30B-A3B

Effective tok/s · 8,000 ctx

MacBook Pro M1 Max 64GBPrefill stress (8K context)MLX fp16 (8.6)1.0 tok/s

This compares wrappers and backends on an 8K prefill-stress scenario. It is useful for long-context teaching, but it is not a canonical decode-speed row.

MLX fp16 8.6 · GGUF 7.6 · MLX bf16 6.0

Research map

Separate measured evidence from discovery noise.

Not every source should influence the product equally. This map makes those roles explicit so rankings can say when an answer is measured, estimated, or still fit-first.

22 methodologies110 source captures20 tracked research sources

First-party

Where measured truth should be upgraded into canon.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

Bitter Mill local inference traces

first party local inference engine

highestbenchmark canon · local inference trace importHighHighLogin neededTreat as a top-priority R&D hygiene surface. For new frontier models, run `npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/doctor-latest.json` first, then run `npm run bench:hygiene:session` with `--clean-recording` only if the doctor exits zero. After export, run `npm run bench:bitter:discover` against repo-local exports or pass `--input <export-path>` for a known Bitter Mill export before external benchmark imports. Normalize only measured runs with complete memory evidence, healthy pressure, no thermal/performance warning, a passing final snapshot, and zero swap-I/O deltas into canonical data; historical swap allocation within the clean-recording threshold is warning-only.

Silicon Score Lab

first party lab

highestbenchmark canon · reproductionHighMediumDirect fetchUse as the highest-confidence anchor and as the source that upgrades important external claims from directional to measured.

Benchmark reference

Where comparable benchmark anchors and runtime-change signals usually appear first.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

Anubis OSS Apple Silicon benchmarks

competitor benchmark tool

highcompetitor freshness tracking · instrumentation referenceMediumHighPartial fetchUse as a first-party lab hygiene and UX reference for telemetry breadth, export shape, operator-facing benchmark history, reasoning-aware metrics, prefill/decode presentation, backend detection, and phys_footprint memory accounting. Treat leaderboard rows as discovery until raw exports, setup context, and Silicon Score hygiene sidecars are captured.

apple-silicon-llm-bench

independent benchmark publication

mediumruntime comparison · methodology examplesMediumHighDirect fetchUse to compare runtime methodology and identify setup-sensitive rows worth reproducing on Silicon Score hardware. Its latest observed commit was April 26, 2026, but the dataset remains a single M3 Max 64 GB tier, so it should inform runtime and methodology work rather than broad Mac buying recommendations.

asiai Apple Silicon benchmarks

competitor benchmark tool

highcompetitor freshness tracking · runtime comparisonMediumHighPartial fetchUse to track competitor capabilities and local-inference workflow expectations. Pay special attention to engine detection, memory isolation, thermal state, context allocation, GPU memory, and agent-readable status APIs. The May 5 refresh still shows asiai exposing MCP and REST paths for status, snapshots, metrics, model listing, engine detection, benchmarking, recommendations, diagnostics, and history, with local interfaces bound to 127.0.0.1 by default. The asiai docs treat nonzero swap as a warning band before a danger band; Silicon Score should keep the pragmatic clean-recording Bitter Mill sidecar policy for canonical first-party imports and should continue recording numeric swap-I/O, final snapshots, thermal state, context allocation, and process-hygiene blockers. Treat shared rows as prompts to improve Silicon Score's measured-data pipeline and UX, not as direct ranking truth.

Awni Hannun benchmark gists

maintainer benchmark gists

highbenchmark corroboration · mlx referenceHighMediumDirect fetchCheck when MLX-related runtime or kernel changes may materially move Apple Silicon performance.

famstack.dev benchmark guides

independent benchmark publication

mediumbenchmark corroboration · runtime comparisonMediumHighDirect fetchUse when an article publishes concrete Apple Silicon measurements with enough setup detail to support benchmark rows or practitioner-runtime notes.

Hugging Face model hub

model registry

highmodel catalog updates · release trackingHighHighDirect fetchUse official publisher cards for model catalog currentness and local-serving notes. Use community quantization cards only as directional fit and reproduction signals unless hardware, runtime, context, memory, benchmark command shape, and source date are explicit; the April 2026 GLM-5.1 MLX quant cards show why these cards can be useful without becoming canonical benchmark truth. The May 2026 Mistral Small 4, Magistral Small, Llama 4 Scout, and Qwen3.5-397B MLX-card refreshes are setup-planning locators, not throughput sources, until a Mac hardware/runtime/speed trace is present.

LLMCheck Apple Silicon benchmarks

competitor benchmark site

highcompetitor freshness tracking · coverage gap detectionMediumHighPartial fetchUse as a currentness and UX benchmark. Monitor the interactive benchmark page, open-data page, and JSON download with `research:compact --refresh` because public counts and reader-visible rows can diverge. The May 5 refresh still shows Updated April 2026 / 180+ Benchmarks and advertises a 50-model / 12-chip / 3-engine benchmark surface, while the refreshed reader view shows the table header followed by no visible rows. The open-data page and JSON endpoint remain the import-review surface with 122 downloadable rows; the JSON reports version 2026-04-17 and Q4_K_M / 256-input / 512-output / 3-run methodology, while row-level quantization fields also include Q8_0. The May 5 full `research:llmcheck:import --dry-run` review, preserved at `.local/llmcheck-imports/full-current-summary.json`, maps all 122 source rows, finds 0 missing mapped rows, reports no unsupported quantizations, and has no skipped source labels. The May 5 targeted frontier dry-run, preserved at `.local/llmcheck-imports/frontier-coverage-summary.json`, still maps 32 requested Qwen3.6, Gemma 4, Mistral Small 4, Llama 4 Scout, gpt-oss 120B, and Nemotron Cascade 2 rows while explicitly reporting no source rows for Devstral Small 2, GLM-4.5-Air, GLM-5.1, Magistral Small, or Qwen3.6-27B. Gemma 4 E2B/E4B, Mistral Small 4, Ministral 3 8B/14B, Nemotron Cascade 2, Llama 4 Scout, gpt-oss 120B, Qwen3.5-9B, Qwen 3 8B, Qwen 3 14B, Phi-4 Mini, Gemma 3 4B, Gemma 3 12B, DeepSeek R1 Distill Llama 8B, Llama 3.1 8B, and Mistral 7B v0.3 rows are official-source verified and may enter Silicon Score as trusted-reference current-market or baseline evidence from the downloadable JSON, but they still need first-party Bitter Mill reproduction with clean-recording sidecars before factory-measured promotion or ranking-changing conclusions. DeepSeek R1 8B and 70B are normalized to DeepSeek-R1-Distill-Llama checkpoints; LLMCheck's generic Ministral 8B/14B labels are normalized to Mistral's Dec. 2025 Ministral 3 8B/14B because the source date and paired 14B rows point to the current family. Qwen 2.5 72B, Qwen 3 14B, Phi-4 Mini, Gemma 3 12B, and DeepSeek R1 Distill Llama 8B are historical/baseline unless current evidence shows practical competitiveness against Qwen3.6-class or Gemma 4 compact models.

mac-llm-bench

independent benchmark publication

mediumcompetitor freshness tracking · methodology examplesMediumHighDirect fetchUse as a competitor/currentness and methodology source for Mac-specific benchmark tooling, especially base M5 and small frontier models. The May 5 refresh found generated M5 base speed tables covering GGUF Q4_K_M and MLX 4-bit rows, HumanEval+ tables using EvalPlus on 164 problems with the expanded test suite, README methodology notes for llama.cpp build 8680, mlx-lm 0.31.2, EvalPlus 0.3.1, and 4096-token context windows, with 8192-token windows for reasoning models run with --no-think. Treat rows as directional until exact raw JSON, machine tier, runtime, quantization, library versions, and Silicon Score zero-swap reproduction gates are reviewed; do not let HumanEval+ or single-quant rows change recommendations without corroboration. The refreshed registry evidence covered Qwen3.6-35B-A3B and Gemma 4 candidates but did not show Mistral Small 4, Magistral Small, GLM, MiniMax, or Qwen3.6-27B entries.

mlx-lm pull requests

maintainer pull requests

highruntime change detection · benchmark corroborationMediumHighDirect fetchUse as an early benchmark and runtime-change source for MLX, then corroborate important results before treating them as strong product truth.

oMLX community benchmarks

competitor benchmark site

highestcompetitor freshness tracking · agentic runtime comparisonMediumHighPartial fetchMonitor as a premier high-churn frontier signal source using `research:compact --refresh`. Current reads show the public count moving from 165,714 to 171,890 rows across the May 4-5 refresh window, with same-day Gemma 4, Qwen3.6, Qwen3.5, gpt-oss-20b, and Nemotron entries on M1 Pro, M1 Max, M2 Max, M3 Max, M4, M4 Pro, M4 Max, M5, M5 Pro, M1 Ultra, and M3 Ultra systems plus row-level GLM-5.1 extreme-memory evidence on M3 Ultra 512 GB. Current first-page churn includes base-M4 Gemma 4 E4B-oQ6 6-bit context rows and base Qwen3.6-35B-A3B 6-bit rows on M3 Max 64 GB, in addition to derivative Qwen3.6 rows. Exact Qwen3.6-35B-A3B-UD M5 32 GB 3-bit long-context rows, Qwen3.6-35B-A3B-oQ4 M5 Max 128 GB, Qwen3.6-35B-A3B M1 Max 64 GB, Qwen3.6-27B M1 Max 32 GB, Gemma 4 31B M5 Max 128 GB, Gemma 4 26B-A4B M5 Pro 64 GB, Gemma 4 26B-A4B M1 Max 64 GB, and Gemma 4 E4B M3 Max 36 GB detail pages remain useful row-level setup evidence with context rows and batching data where available. The filtered Qwen3.6-27B view adds M4 Max 64 GB oQ4 rows across 1k, 4k, 8k, and 16k context, while the previous first-page read included Qwen3.5-122B-A10B-Text-qx85 8-bit M3 Max 128 GB rows and page 2 added Qwen3.5-122B-A10B-Text-qx64-hi 6-bit rows. Exact pages expose PP/TG throughput, TTFT, peak memory, macOS, and oMLX version fields. Earlier first-page reads showed M3 Ultra Gemma 4 31B 8-bit rows, M4 Max 128 GB Qwen3.6-35B-A3B-oQ5e long-context rows, Nemotron 3 Nano 30B-class sweeps, Gemma 4 26B-A4B MXFP4 rows, Qwen3.6-35B-A3B rows, and derivative or fine-tuned Qwen3.6 labels on M5 Max, M4 Max, M4 Pro, M3 Max, and M2 Max systems. Row-level detail matters: context coverage, TTFT, peak memory, macOS, oMLX version, quantization, and lineage can change the interpretation of headline TG tok/s. A current MiniMax-M2.7-ConfigI M5 Max 128 GB 4-bit detail page shows fit and context behavior that justifies first-party reproduction but should stay out of canonical benchmark rows until Bitter Mill or trusted-reference evidence verifies setup lineage. The late May 4 Mistral Small 4 refresh found no row-level oMLX result for that exact model; LLMCheck and SharpAI HomeSec-Bench now provide structured Apple Silicon speed signals, while oMLX remains a gap until a row-level result or Bitter Mill trace appears. Silicon Score should use oMLX to discover high-interest Apple Silicon setups quickly while keeping community rows directional, derivative/fine-tune lineage-gated, and date-stamped as source observations. Queue first-party reproduction for any oMLX row that would change rankings, especially Qwen3.5/Qwen3.6, Gemma 4, MiniMax M2.7, GLM-5.1, Mistral Small 4, long-context, 8-bit quantization, and agent-cache cases.

oMLX repository

runtime repo

mediumruntime capability tracking · apple silicon runtime changesMediumHighDirect fetchMonitor for Apple Silicon caching, serving, and long-context changes that could materially shift agentic coding viability on Macs.

SharpAI HomeSec-Bench

published benchmark page

mediumbenchmark corroboration · apple silicon laptop coverageMediumHighDirect fetchUse as a structured external reference for M5 Pro laptop coverage and for cross-checking frontier local-agent claims before lab reproduction.

vLLM-MLX repository

runtime repo

mediumruntime capability tracking · apple silicon runtime changesMediumHighDirect fetchUse for runtime roadmap and capability-shift detection, then validate important claims against Silicon Score’s decision lenses.

Zach Rattner benchmark gists

community benchmark gists

mediumbenchmark corroboration · coverage expansionMediumMediumDirect fetchUse as structured corroboration when the gist includes raw benchmark tables and an explicit machine description.

Practitioner

Where operators reveal workflow reality and caveats before the benchmark layer catches up.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

llama.cpp GitHub discussions

maintainer discussion

highbaseline reference · runtime change detectionMediumHighDirect fetchTreat as an early-warning and context source. Promote specific posts into canonical evidence only after methodology review.

Reddit /r/LocalLLaMA

operator forum

highpractitioner signals · frontier watchlistMediumHighPartial fetchUse for directional practitioner signals and frontier prioritization, not as standalone performance truth.

Discovery

Where release movement starts, but not where performance truth should harden.

SourcePriorityBest useTrustFreshnessAccessWhy it matters

LocalScore accelerator runs

community benchmark aggregator

mediumcoverage discovery · gap detectionLowHighPartial fetchUse as broad discovery and historical coverage rather than a primary Apple Silicon frontier signal source. Prefer direct result or accelerator URLs over aggregate rankings, and upgrade important rows via corroboration or first-party reproduction.

X local-AI chatter

social discovery

mediumearly discovery · maintainer watchLowHighManual onlyTreat as discovery only. Do not treat X posts as canonical evidence without a corroborating source that can be fetched, archived, and cited.