Run Bitter Mill current-frontier batch on the owned M5 Max
Qwen3.6, MiniMax M2.7, Gemma 4 including E2B/E4B, Mistral Small 4, Ministral 3 compact models, Llama 4 Scout, gpt-oss 120B/20B, Nemotron Cascade 2, GLM-4.5-Air, Magistral Small, Devstral Small 2, and current Qwen coding/dense anchors should be reproduced from first-hand Bitter Mill runtime, quantization, and context sweeps before they become first-party measured evidence.
22 models · 1 source
Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.
Runbook
data/ops/bitter-mill/current-frontier-m5-max-2026-05.jsonNext command
npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-frontier-m5-max-2026-05.doctor.jsonMonitor command
npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-frontier-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-frontier-m5-max-2026-05/system-hygiene.json Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling
Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.
first party bitter mill batch
Run planned M4 Ultra 256GB frontier arrival batch
When the 256GB Mac Studio arrives, run a clean-recording Bitter Mill sweep across the active M5 current-frontier target set including MiniMax M2.7 plus high-memory and fit-boundary extras, runtimes, quantizations including current MLX dynamic low-bit profiles, and context ladders before changing Silicon Score's featured environment.
26 models · 1 source
Status: Blocked until arrival · M4 Ultra 256 GB stays queued until the machine is physically present, hardware identity is verified, and clean-recording preflight passes.
Runbook
data/ops/bitter-mill/planned/m4-ultra-256gb-frontier-2026-06.jsonNext command
npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/m4-ultra-256gb-frontier-2026-06.doctor.jsonMonitor command
npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-m4-ultra-256gb-frontier-2026-06 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/m4-ultra-256gb-frontier-2026-06/system-hygiene.json Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling
Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened on the arrived Mac Studio. / The active all-runbooks readiness audit passes for checked-in active runbooks, then the planned M4 Ultra runbook's explicit check passes on the arrived Mac Studio.
first party bitter mill batch
Prepare M4 Ultra 256GB Mac Studio as the next featured lab environment
In June 2026 the local lab is expected to receive a 256GB Mac Studio provisionally labeled M4 Ultra, replacing the M5 Max 128GB MacBook Pro as the featured environment only after arrival, hardware detection, and clean first-party evidence justify changing the public default.
1 source
Status: Blocked until arrival · M4 Ultra 256 GB stays queued until the machine is physically present, hardware identity is verified, and clean-recording preflight passes.
Next command
npm run validate:lab-environments Hygiene gate: Clean recording · pre-existing swap warning-only · passing final snapshot · 0 swap I/O pages · no thermal/performance throttling
Gate: data/ops/lab-environments.json keeps the M5 Max 128GB MacBook Pro as current active featured environment until the M4 Ultra 256GB Mac Studio is physically present. / Arrival capture records system_profiler chip name, GPU cores, model identifier, and 256GB unified memory before any featured_local_benchmark or FEATURED_MACHINE_ID change.
lab environment transition
Upgrade Qwen3.5-122B-A10B reference rows with Bitter Mill runs
Qwen3.5-122B-A10B is a high-memory frontier MoE candidate with trusted-reference rows and field speed reports, and this queued Bitter Mill import will test whether it belongs in high-end Mac advice.
1 model · 1 source
Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.
Next command
npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-122b-a10b.doctor.jsonMonitor command
npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-122b-a10b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-122b-a10b.system-hygiene.json Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling
Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.
first party bitter mill import
Upgrade Qwen3.5-397B-A17B reference rows with Bitter Mill runs
Qwen3.5-397B-A17B is the current high-memory Qwen3.5 flagship with sparse Apple Silicon reference evidence, and this queued Bitter Mill import keeps it behind clean first-party reproduction.
1 model · 1 source
Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.
Next command
npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/import-bitter-mill-qwen-3-5-397b-a17b.doctor.jsonMonitor command
npm run bench:hygiene:session -- monitor --clean-recording --label import-bitter-mill-qwen-3-5-397b-a17b --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/qwen-3-5-397b-a17b.system-hygiene.json Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling
Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The hygiene monitor starts before Bitter Mill is opened and stops only after the model export lands; a same-base monitor hygiene sidecar exists beside the Bitter Mill export.
first party bitter mill import
Run Bitter Mill current-reference batch on the owned M5 Max
Qwen 3 32B, Qwen 3 235B-A22B, Devstral Small 1.1, Qwen 3 8B, Nemotron-3-Nano, GLM-4.7-Flash, Phi-4 14B, Mistral Small 3.1, DeepSeek R1 distills, and Qwen 2.5 72B have Apple Silicon evidence or practitioner signals but should be treated as reference/baseline reproduction work rather than the current frontier lane. Qwen3.5-35B-A3B stays in the current-frontier M5 batch as the practical Qwen3.6 displacement comparator.
12 models · 1 source
Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.
Runbook
data/ops/bitter-mill/current-reference-m5-max-2026-05.jsonNext command
npm run bench:hygiene:doctor -- --clean-recording --output .local/benchmark-hygiene/current-reference-m5-max-2026-05.doctor.jsonMonitor command
npm run bench:hygiene:session -- monitor --clean-recording --label bitter-mill-current-reference-m5-max-2026-05 --sample-interval-ms 15000 --output data/ops/bitter-mill/incoming/current-reference-m5-max-2026-05/system-hygiene.json Hygiene gate: Clean recording · clean intake root · memory metrics present · pre-existing swap warning-only · passing final snapshot · passing monitor samples · 0 swap I/O pages · no thermal/performance throttling
Gate: The hygiene doctor reports clean recording readiness before Bitter Mill is opened; if it reports excessive swap allocation, memory pressure, stale local inference processes, or thermal/performance warnings, stop before any inference run. / The all-runbooks readiness audit checks every Bitter Mill runbook with one captured machine and clean-recording system-hygiene snapshot before any runbook is opened.
first party bitter mill batch
Run M5 Max 128 GB frontier anchors under clean-recording lab hygiene
Use the owned M5 Max 128 GB plan to convert frontier and coding-agent watchlist signals into first-party Silicon Score lab rows, but only when the recording window proves zero swap I/O and clean system pressure.
8 models · 1 source
Status: Ready after hygiene · Run the listed clean-recording gate before benchmark execution, discovery, import, or publication changes.
Next command
npm run bench:lab:m5:anchors:dry-run -- --preflight --clean-recording Hygiene gate: Clean recording · memory metrics present · pre-existing swap warning-only · passing final snapshot · 0 swap I/O pages
Gate: Actual run must occur on a detected M5 Max 128 GB machine; runner mismatch warnings are blockers for canonical factory_measured promotion. / Output benchmark records resolve chip, RAM, model, runtime, quantization, source date, and generation tok/s before append review.
lab verification
Monitor premier Apple Silicon benchmark sites for freshness gaps
Track LLMCheck, oMLX, asiai, Anubis OSS, apple-silicon-llm-bench, mac-llm-bench, and broad LocalScore discovery so Silicon Score notices frontier releases, benchmark hygiene patterns, and UX/evidence gaps before public rankings go stale.
7 sources
Next command
npm run research:compact -- read --url https://llmcheck.net/benchmarks --query "Qwen3.6 Gemma 4 Apple Silicon benchmark methodology raw measurements" --maxChars 2400 --refresh Hygiene gate: Clean recording · pre-existing swap warning-only
Gate: Every candidate source fact is backed by a fetched URL or captured artifact. / Freshness-review reads use research:compact --refresh so currentness checks do not silently reuse stale cache entries.
competitor freshness review
Corroborate GLM-5.1 Apple Silicon viability
Official GLM-5.1 metadata and local-serving docs are captured, plus row-level oMLX and Hugging Face MLX quantization evidence on M3 Ultra 512GB; these are directional field signals until first-party reproduction captures clean-recording hygiene.
1 model · 7 signals · 5 sources
Next command
npm run research:compact -- read --url https://huggingface.co/zai-org/GLM-5.1 --query "local deployment quantized GGUF MLX KTransformers GLM-5.1 Apple Silicon" --maxChars 3000 --refresh Hygiene gate: Clean recording · pre-existing swap warning-only
Gate: The official GLM-5.1 model metadata and release note are captured before public ranking copy treats it as current. / Any Apple Silicon row must explicitly identify GLM-5.1, hardware, runtime, quantization, context, and source date before it becomes a practitioner signal or benchmark candidate.
coverage expansion
Corroborate Mistral Small 4 Apple Silicon runtime
Mistral Small 4 has official local-serving support, Hugging Face MLX conversion cards, LLMCheck trusted-reference Apple Silicon rows, and SharpAI HomeSec-Bench M5 Pro 64GB llama.cpp field reports. The HomeSec rows make it a concrete reproduction candidate, but they remain directional community/operator evidence until first-party Bitter Mill captures setup, quantization, context, methodology, and hygiene sidecars.
1 model · 3 signals · 6 sources
Next command
npm run research:compact -- read --url https://www.sharpai.org/benchmark/ --query "Mistral-Small-4-119B Q2_K_XL UD-IQ1_M MacBook Pro M5 Pro 64GB llama.cpp tok/s TTFT HomeSec-Bench" --maxChars 5000 --refresh Hygiene gate: Clean recording · pre-existing swap warning-only
Gate: Any Apple Silicon row must explicitly identify Mistral Small 4, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The SharpAI HomeSec-Bench M5 Pro 64GB rows are curated directional community_operator_report signals; use them to design first-party reproduction, not as canonical benchmark rows.
coverage expansion
Corroborate Magistral Small Apple Silicon runtime
Magistral Small has official 24B reasoning-model grounding, a fit note that the quantized model can run within a 32GB RAM MacBook, and Hugging Face MLX conversion cards, but the 2026-05-05 compact refresh found no row-level Apple Silicon throughput evidence in LLMCheck, oMLX, mac-llm-bench, broad search, LocalLLaMA search, or the MLX cards. The existing M5 Max Bitter Mill batch owns first-party measurement; this task keeps external/runtime corroboration explicit without promoting fit guidance into speed evidence.
1 model · 1 signal · 5 sources
Next command
npm run research:compact -- read --url https://huggingface.co/mistralai/Magistral-Small-2506 --query "Magistral Small 2506 local inference quantized MacBook 32GB context degradation GGUF MLX Apple Silicon speed" --maxChars 4200 --refresh Hygiene gate: Clean recording · pre-existing swap warning-only
Gate: Any Apple Silicon row must explicitly identify Magistral Small 2506, hardware, runtime, quantization, context or prompt shape, reported speed, and source date before it becomes a practitioner signal or benchmark candidate. / The official Hugging Face card and Mistral release post ground currentness, self-deployment, and runtime planning only; they do not establish Apple Silicon throughput.
coverage expansion
Corroborate Llama 3.3 70B cross-tier coverage
Llama 3.3 70B now has Apple Silicon rows across M1, M2, M3, M4, and M5-era tiers, including trusted-reference LLMCheck anchors on M5 Max and M4 Ultra.
1 model · 4 sources
Next command
npm run research:compact -- read --url https://llmcheck.net/data/benchmarks.json --query "Llama 3.3 70B Apple Silicon chip ram engine tps date" --maxChars 2400 --refresh Gate: Any candidate coverage row must explicitly identify Llama 3.3 70B, Apple Silicon hardware, RAM tier or clearly bounded chip tier, runtime, quantization, reported speed, and source date. / New community or competitor rows stay directional unless methodology and provenance are strong enough for trusted-reference treatment.
coverage expansion