Best local LLMs for this Mac

25 current modelsCatalog current through April 22, 2026Benchmark evidence through April 27, 2026

MacBook Pro M5 Max 128GB 16-inch ranked for coding with a most capable bias, using the best available runtime evidence. focused on the current market set.

Mac

Capability

Bias

Runtime

Use the strongest current runtime evidence for each row.Largest fit: MiniMax M2.7 at 3bit (229B parameters)Fastest read: Gemma 4 E2B at 158.0 tok/s on MLXRanking evidence: Gemma 4 31B, Qwen3.6-27B, Qwen3.6-35B-A3B +1 are current candidates; sparse rows stay labeled until first-party evidence lands.Next featured Mac: Mac Studio M4 Ultra 256GB planned for June 2026; current default changes after arrival validation and clean first-party evidence.25 historical baseline rows hiddenBaselines

Current ranking evidence

Fresh releases stay visible, but sparse evidence remains explicit.

Gemma 4 31B

released 2026-04-02 · 5 official specs captured · 4 benchmark rows · 6 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 28.0 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned

Qwen3.6-27B

released 2026-04-22 · 5 official specs captured · 1 benchmark row · 9 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 85.5 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned

Qwen3.6-35B-A3B

released 2026-04-15 · 5 official specs captured · 4 benchmark rows · 15 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 203.1 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned

Mistral Small 4 119B

released 2026-03-16 · 6 official specs captured · 3 benchmark rows · 1 Apple Silicon field source · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 44.0 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned

Rank	Model	Score	Quant	Tok/s	Runtime	Evidence	Headroom	Context	Why it ranks here
1	Gemma 4 31B30.7B parameters	290	8bit	26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	91.4 GB	87k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 91.4 GB headroom leaves workable context margin.
2	Qwen3.5-27B27B parameters	283	8bit	31.6 tok/s Fastest evidence path: 8bit · 31.6 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	100.4 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 31.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 100.4 GB headroom leaves workable context margin.
3	Qwen3.6-27B27B parameters	280	8bit	16.6 tok/s Fastest evidence path: 8bit · 16.6 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	100.4 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 16.6 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 100.4 GB headroom leaves workable context margin.
4	Devstral Small 2 24B24B parameters	274	8bit	23.4 tok/s Fastest evidence path: 8bit · 23.4 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	103.9 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 23.4 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 103.9 GB headroom leaves workable context margin.
5	Qwen3.6-35B-A3B3B active / 35B total	252	8bit	55.0 tok/s Fastest evidence path: 8bit · 55.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	94.3 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 55.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 94.3 GB headroom leaves workable context margin.
6	Mistral Small 4 119B6.5B active / 119B total	193	Q6_K	42.0 tok/s Fastest evidence path: Q6_K · 42.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	32.1 GB	32k	Recent frontier candidate in the current catalog. Q6_K is the highest practical quality here. 42.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 32.1 GB headroom leaves workable context margin.
7	MiniMax M2.7229B parameters	422	3bit Source-backed MLX MiniMax-M2.7-3bit - 112 GB min	40.4 tok/s Fastest evidence path: 3bit · 40.4 tok/s · Best available · Field signal	Best available	Field signalFirst-party M5 batch queued	16.0 GB	116k	Recent model release in the current catalog. 3bit is the highest practical quality here. 40.4 tok/s only from Apple Silicon field signals. 16.0 GB headroom leaves workable context margin.
8	Gemma 4 26B-A4B3.8B active / 25.2B total	252	8bit	50.0 tok/s Fastest evidence path: 8bit · 50.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	102.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 50.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 102.2 GB headroom leaves workable context margin.
9	Nemotron Cascade 2 30B-A3B3B active / 30B total	250	8bit	28.0 tok/s Fastest evidence path: 8bit · 28.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	99.2 GB	1000k	Recent model release in the current catalog. 8bit is the highest practical quality here. 28.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 99.2 GB headroom leaves workable context margin.
10	Qwen3.5-35B-A3B3B active / 35B total	249	8bit	48.0 tok/s Fastest evidence path: 8bit · 48.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	94.3 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 48.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 94.3 GB headroom leaves workable context margin.
11	GLM-4.7-Flash3B active / 30B total	247	8bit	58.0 tok/s Fastest evidence path: 8bit · 58.0 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	92.2 GB	90k	Recent model release in the current catalog. 8bit is the highest practical quality here. 58.0 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 92.2 GB headroom leaves workable context margin.
12	Nemotron-3-Nano-30B-A3B3.5B active / 30B total	246	8bit	43.7 tok/s Fastest evidence path: 8bit · 43.7 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	99.2 GB	1000k	Recent model release in the current catalog. 8bit is the highest practical quality here. 43.7 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 99.2 GB headroom leaves workable context margin.
13	Magistral Small24B parameters	242	8bit	Measure it	Best available	Fit-firstFirst-party M5 batch queued	103.9 GB	41k	8bit is the highest practical quality here. Speed still needs direct speed coverage. 103.9 GB headroom leaves workable context margin.
14	Ministral 3 14B14B parameters	232	8bit	40.0 tok/s Fastest evidence path: 8bit · 40.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	113.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 40.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 113.2 GB headroom leaves workable context margin.
15	Gemma 4 E4B8B parameters	230	8bit	128.0 tok/s Fastest evidence path: 8bit · 128.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	119.4 GB	131k	Recent model release in the current catalog. 8bit is the highest practical quality here. 128.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 119.4 GB headroom leaves workable context margin.
16	Qwen3.5-9B9B parameters	230	8bit	78.0 tok/s Fastest evidence path: 8bit · 78.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	118.1 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 78.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 118.1 GB headroom leaves workable context margin.
17	Ministral 3 8B8B parameters	223	8bit	72.0 tok/s Fastest evidence path: 8bit · 72.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	119.0 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 72.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 119.0 GB headroom leaves workable context margin.
18	gpt-oss 20B3.6B active / 21B total	217	8bit	Measure it	MLX	Fit-firstFirst-party M5 batch queued	107.6 GB	131k	8bit is the highest practical quality here. Speed still needs direct speed coverage. 107.6 GB headroom leaves workable context margin.
19	Qwen3.5-122B-A10B10B active / 122B total	197	Q6_K	60.6 tok/s Fastest evidence path: Q6_K · 60.6 tok/s · MLX · Estimated	MLX	EstimatedBitter Mill import queued	33.6 GB	165k	Recent model release in the current catalog. Q6_K is the highest practical quality here. 60.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 33.6 GB headroom leaves workable context margin.
20	Llama 4 Scout 17B-16E17B active / 109B total	196	8bit	26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	24.5 GB	37k	8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 24.5 GB headroom leaves workable context margin.
21	Qwen3-Coder-Next3B active / 80B total	192	8bit	74.3 tok/s Fastest evidence path: 8bit · 74.3 tok/s · MLX · Community row	MLX	Community rowFirst-party M5 batch queued	52.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 74.3 tok/s benchmark-backed on MLX backend. 52.2 GB headroom leaves workable context margin.
22	GLM-4.5-Air12B active / 106B total	190	8bit	18.0 tok/s Fastest evidence path: 8bit · 18.0 tok/s · LM Studio · Estimated	LM Studio	EstimatedFirst-party M5 batch queued	27.3 GB	55k	8bit is the highest practical quality here. 18.0 tok/s estimated from nearby benchmark coverage, with LM Studio wrapper on mixed as the best runtime hint. 27.3 GB headroom leaves workable context margin.
23	Gemma 4 E2B5.1B parameters	170	8bit	158.0 tok/s Fastest evidence path: 8bit · 158.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	122.5 GB	131k	Recent model release in the current catalog. 8bit is the highest practical quality here. 158.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.5 GB headroom leaves workable context margin.
24	Qwen3.5-4B4B parameters	167	8bit	148.0 tok/s Fastest evidence path: 8bit · 148.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	122.8 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 148.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.8 GB headroom leaves workable context margin.
25	gpt-oss 120B5.1B active / 117B total	164	Q6_K	7.0 tok/s Fastest evidence path: Q6_K · 7.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	37.6 GB	131k	Q6_K is the highest practical quality here. 7.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 37.6 GB headroom leaves workable context margin.

Featured Environment

Why this Mac is featured

The featured Mac is the MacBook Pro M5 Max 128GB 16-inch, the current first-party frontier inference environment. Use the ranking above to switch to any other Mac.

Featured Mac: MacBook Pro M5 Max 128GB 16-inch

Qwen3.6-27B is the current agent-capability answer on the featured Mac. The table keeps fit, quantization, runtime, speed, and evidence status together so field signals do not look like first-party truth.

Open featured rankings Audit evidence

Runtime and quantization drilldown

MacBook Pro M5 Max 128GB 16-inch currently has 33 direct benchmark rows and 45 matching field-speed claims. Use the runtime selector and Bench before treating any speed winner as settled.

Open Bench Check fit

Next featured environment

The default is a pointer, not a fork. Mac Studio M4 Ultra 256GB is planned for June 2026; the same Mac-first structure moves there only after arrival validation and clean first-party evidence, while the current featured record remains available as a normal machine page and Bench audit target.

Browse machines Compare Macs

If the real decision is local Mac versus rented GPU economics, compare the hardware path in Worth with AI Datacenter Index.

Frequently asked questions

Why does Silicon Score default to MacBook Pro M5 Max 128GB 16-inch right now?: MacBook Pro M5 Max 128GB 16-inch is the local featured machine in hand for 2026 Apple Silicon inference. That makes it the best default lab question for now: what is the strongest model that fits, which runtime is fastest, and how much evidence backs the answer?
How should I read field signals versus benchmark rows?: Field signals are directional reports from Apple Silicon practitioners and competitor surfaces. They stay labeled until Silicon Score reproduces the setup with clean local recordings, methodology, and provenance suitable for Bench.
How much unified memory do you need for local LLMs on a Mac?: Unified memory determines which quantization tier fits cleanly, how much context you can keep live, and whether a recommendation stays practical after launch. In practice, 16GB can be useful for compact models, 24GB to 48GB opens more working room, and 64GB-plus tiers are where larger frontier setups become more credible. Use Fit to verify the exact model and headroom instead of relying on RAM labels alone.
What does tok/s mean and how much do I need?: Tokens per second measures how fast a model generates text. Around 8 to 15 tok/s usually feels interactive, while heavier coding, agent, or batch workflows benefit from more. Silicon Score ranks Macs with benchmark-backed or explicitly labeled estimated speed so you can trade responsiveness against model quality with the evidence visible.
Is local LLM inference on a Mac cheaper than using an API?: If you run daily, want predictable privacy, or repeatedly use the same model class, local hardware can make sense. If your usage is spiky or you expect to burst into much larger models, compare the Mac path against rented GPU economics on AI Datacenter Index before you over-buy hardware.
What quantization format should I use on a Mac?: GGUF remains a common path for llama.cpp and Ollama on Macs, but there is no single best quantization for every machine. The practical answer depends on memory ceiling, runtime support, and how much quality loss you can tolerate. Start from the quantization and runtime surfaced in Rankings and Bench instead of assuming one preset wins everywhere.