Qwen 3 on Apple Silicon — Performance Guide
Qwen 3 is Alibaba's 2025 model family with a trick: the 30B A3B MoE variant runs at 92 tok/s on M4 Max — faster than Llama 3.1 8B. Here's what you need to know to pick the right Qwen 3 model for your Mac.
Qwen 3 family overview
Qwen 3 comes in two architectures: dense models (standard transformer) and MoE (mixture of experts) models. MoE models activate only a fraction of parameters per token — much faster inference at the same "quality tier".
| Model | Type | Active params | RAM at Q4 | M4 Max 64 GB tok/s | Context |
|---|---|---|---|---|---|
| Qwen 3 0.6B | Dense | 0.6B | ~0.4 GB | — | 32K |
| Qwen 3 4B | Dense | 4B | ~2.5 GB | 148 tok/s (Q4) | 32K |
| Qwen 3 8B | Dense | 8B | ~5 GB | — | 128K |
| Qwen 3 30B A3B | MoE | 3B active / 30B total | ~16 GB | 92 tok/s (Q4) | 128K |
| Qwen 3 32B | Dense | 32B | ~20 GB | 22 tok/s (Q4_K_M) | 128K |
| Qwen 3 235B A22B | MoE | 22B active / 235B total | ~140 GB | — (needs 128 GB+) | 128K |
The MoE advantage: Qwen 3 30B A3B
The 30B A3B model activates only 3B parameters per token despite having 30B total weights. At Q4 quantization, it fits in 16 GB RAM and runs faster than most 8B dense models.
| Model | RAM at Q4 | M4 Max 64 GB tok/s | Speed vs 8B Llama |
|---|---|---|---|
| Llama 3.1 8B (dense) | ~5 GB | 52.4 tok/s | — |
| Qwen 3 30B A3B (MoE) | ~16 GB | 92.1 tok/s | +76% |
| Qwen 3 32B (dense) | ~20 GB | 22.0 tok/s | −58% |
Qwen 3 30B A3B delivers roughly 30B-class quality at 8B-class speeds — or faster. If you're deciding between Llama 3.1 8B (dense) and Qwen 3 30B A3B (MoE), pick 30B A3B if you have 16+ GB free.
Qwen 3 30B A3B quantization ladder — M4 Max (40-core GPU, 64 GB)
| Quantization | RAM required | Avg tok/s | vs Q4 |
|---|---|---|---|
| Q4 | ~16 GB | 92.1 tok/s | — |
| Q5 | ~19 GB | 84.9 tok/s | −8% |
| Q6 | ~22 GB | 76.7 tok/s | −17% |
| Q8 | ~30 GB | 52.6 tok/s | −43% |
Q4 is the sweet spot for 30B A3B — the speed loss from higher quantization is significant with diminishing quality returns on MoE models. Source: MLX reference runs.
Which Qwen 3 model for which Mac?
16 GB RAM (M4, M5 base)
- Best pick: Qwen 3 4B (Q4/Q8) — 148 tok/s, fast, capable
- Qwen 3 8B at Q4 fits comfortably, ~75–85 tok/s estimated
- Qwen 3 30B A3B technically fits at Q4 (~16 GB) but may be tight with OS
- Qwen 3 0.6B for extremely fast, simple tasks
- Avoid: 32B dense (doesn't fit)
24–32 GB RAM (M4 Pro base, M2 Pro)
- Best pick: Qwen 3 30B A3B (Q4, ~16 GB) — 92+ tok/s
- Qwen 3 8B comfortably at Q8
- Qwen 3 32B won't fit at Q4 (needs ~20 GB + OS overhead)
- Run 30B A3B at Q4 for the best quality-per-speed ratio
36–64 GB RAM (M4 Max, M3 Max)
- Best pick for quality: Qwen 3 32B (Q4_K_M, ~20 GB) — 22 tok/s dense quality
- Best pick for speed: Qwen 3 30B A3B (Q8, ~30 GB) — 52+ tok/s
- Can run 30B A3B Q6 at 77 tok/s with higher quality than Q4
- 64 GB comfortably handles Qwen 3 32B at Q8 (~34 GB)
128 GB RAM (M4 Max 128 GB, M1/M2/M3 Ultra)
- Best pick: Qwen 3 32B at Q8 — full quality, ~32 GB, plenty of headroom
- Qwen 3 235B A22B at Q4 fits (~140 GB) — 8.1 tok/s on M4 Max 128 GB
- Run 30B A3B at any quantization with full context window
- Multiple models loaded simultaneously possible
Qwen 3 4B quantization ladder — M4 Max (40-core GPU, 64 GB)
The 4B variant is extremely fast on M4 Max and stays useful even at Q8 (111 tok/s). Great for real-time interactive use.
| Quantization | RAM required | Avg tok/s |
|---|---|---|
| Q4_G32 | ~2.5 GB | 149.1 tok/s |
| Q4 | ~2.4 GB | 148.1 tok/s |
| Q5 | ~2.9 GB | 143.2 tok/s |
| Q5_G32 | ~3.0 GB | 143.0 tok/s |
| Q6 | ~3.4 GB | 136.6 tok/s |
| Q8 | ~4.5 GB | 111.5 tok/s |
Source: MLX reference runs on M4 Max 40-core GPU, 64 GB. See Ollama vs MLX for runtime differences.
Qwen 3 vs Llama 3 — which family?
Choose Qwen 3 if:
- You need 30B+ quality at 8B speeds — 30B A3B MoE is unmatched
- You need 128K context window (vs Llama 3.1's 128K for 8B only)
- You want the latest model family with strong multilingual support
- You have 24+ GB RAM and want the best quality-per-tok tradeoff
- Coding and reasoning are primary tasks (Qwen 3 trained specifically for this)
Choose Llama 3 if:
- You have only 8 GB and need a capable model (Llama 3.1 8B at ~5 GB)
- Ecosystem compatibility is important — Llama is more widely supported
- You want the simplest setup with the widest runtime support
- You need the best 70B model (Llama 3.3 70B is top-tier dense)
- Running on Ollama with default settings — Llama variants are better tested
Qwen 3 235B A22B — the 128 GB+ MoE frontier
Qwen 3 235B A22B activates 22B parameters per token from a 235B total pool. At Q4_K_M, it requires ~140 GB — only 128 GB+ Macs can run it (and it will be tight).
Performance on M4 Max 128 GB
- 8.1 tok/s at Q4_K_M — slow but functional
- This is frontier-tier on local hardware
- Context window: 10K tokens measured (128K max spec)
- Similar quality tier to GPT-4-class on benchmarks
Which hardware to run it on
- M4 Max 128 GB: 8.1 tok/s (confirmed, reference run)
- M2 Ultra 192 GB: likely fits better; faster throughput expected
- M3 Ultra 256 GB: most comfortable, higher throughput
- At ~8 tok/s, use for batch/background processing, not live chat
Explore Qwen 3 benchmarks
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export