Research & Reasoning

Research tasks benefit from larger models — better reasoning, more nuanced outputs. This means 32B+ models where quality matters more than raw speed. Here is the hardware data.

30B–70BTypical model size

48–128 GBRecommended RAM

Qwen 3 32B, Qwen 3 30BKey models

10Benchmark rows

Why these models for this use case

Research and reasoning tasks need model quality that only comes at 32B+ parameter counts. Qwen 3 32B at Q4_K_M runs at ~22 tok/s on M4 Max 64 GB — fast enough for interactive research, and the quality gap versus 7B is dramatic. For frontier-scale local inference (Qwen 3 235B A22B at 8 tok/s on M4 Max 128 GB), you need 128 GB+ unified memory. Speed is acceptable for batch tasks even if it feels slow for real-time chat.

Benchmark results — fastest rows first

Filtered to models commonly used for research & reasoning. Sorted by avg tok/s descending.

Chip	Model	Quant	RAM req.	Avg tok/s	Runtime	Source
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q4	16.12 GB	92.1 tok/s	MLX	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q5	18.09 GB	84.9 tok/s	MLX	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q6	21.87 GB	76.7 tok/s	MLX	ref
M4 Max (128 GB)	Qwen 3 30B A3B	Q4_K_M	—	70.2 tok/s	LM Studio	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q8	29.78 GB	52.6 tok/s	MLX	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 32B	Q4_K_M	20 GB	22.0 tok/s	factory harness	factory lab
M4 Max (128 GB)	Gemma 3 27B	Q8_0	—	14.5 tok/s	LM Studio	ref
M4 Max (32-core GPU)	Qwen 3 32B	iQ2_K_S	11 GB	13.2 tok/s	—	ref
M4 Max (128 GB)	Qwen 3 235B A22B	Q4_K_M	—	8.1 tok/s	LM Studio	ref
M4 Max (24-core GPU)	Llama 3.3 70B	Q5_K_M	50 GB	7.1 tok/s	—	ref

Recommended chips for this use case

M4 Max (40-core GPU, 64 GB)M4 Max (40-core GPU, 128 GB)M4 Max (128 GB)M3 Max (40-core GPU, 128 GB)M3 Ultra (80-core GPU, 256 GB)

Other use cases

Coding assistant Research & reasoning Offline chat

Data

benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv

Buying guide: best Mac for local LLMs →