← All benchmarks

Qwen 3 on Apple Silicon — Performance Guide

Qwen 3 is Alibaba's 2025 model family with a trick: the 30B A3B MoE variant runs at 92 tok/s on M4 Max — faster than Llama 3.1 8B. Here's what you need to know to pick the right Qwen 3 model for your Mac.

92 tok/sQwen 3 30B A3B (Q4) on M4 Max 64 GB — faster than most 8B models
22 tok/sQwen 3 32B dense (Q4_K_M) on M4 Max 64 GB
128KContext window on Qwen 3 32B and 30B A3B
16 GBMinimum RAM for Qwen 3 30B A3B at Q4

Qwen 3 family overview

Qwen 3 comes in two architectures: dense models (standard transformer) and MoE (mixture of experts) models. MoE models activate only a fraction of parameters per token — much faster inference at the same "quality tier".

ModelTypeActive paramsRAM at Q4M4 Max 64 GB tok/sContext
Qwen 3 0.6BDense0.6B~0.4 GB32K
Qwen 3 4BDense4B~2.5 GB148 tok/s (Q4)32K
Qwen 3 8BDense8B~5 GB128K
Qwen 3 30B A3BMoE3B active / 30B total~16 GB92 tok/s (Q4)128K
Qwen 3 32BDense32B~20 GB22 tok/s (Q4_K_M)128K
Qwen 3 235B A22BMoE22B active / 235B total~140 GB— (needs 128 GB+)128K

The MoE advantage: Qwen 3 30B A3B

The 30B A3B model activates only 3B parameters per token despite having 30B total weights. At Q4 quantization, it fits in 16 GB RAM and runs faster than most 8B dense models.

ModelRAM at Q4M4 Max 64 GB tok/sSpeed vs 8B Llama
Llama 3.1 8B (dense)~5 GB52.4 tok/s
Qwen 3 30B A3B (MoE)~16 GB92.1 tok/s+76%
Qwen 3 32B (dense)~20 GB22.0 tok/s−58%

Qwen 3 30B A3B delivers roughly 30B-class quality at 8B-class speeds — or faster. If you're deciding between Llama 3.1 8B (dense) and Qwen 3 30B A3B (MoE), pick 30B A3B if you have 16+ GB free.

Qwen 3 30B A3B quantization ladder — M4 Max (40-core GPU, 64 GB)

QuantizationRAM requiredAvg tok/svs Q4
Q4~16 GB92.1 tok/s
Q5~19 GB84.9 tok/s−8%
Q6~22 GB76.7 tok/s−17%
Q8~30 GB52.6 tok/s−43%

Q4 is the sweet spot for 30B A3B — the speed loss from higher quantization is significant with diminishing quality returns on MoE models. Source: MLX reference runs.

Which Qwen 3 model for which Mac?

16 GB RAM (M4, M5 base)

  • Best pick: Qwen 3 4B (Q4/Q8) — 148 tok/s, fast, capable
  • Qwen 3 8B at Q4 fits comfortably, ~75–85 tok/s estimated
  • Qwen 3 30B A3B technically fits at Q4 (~16 GB) but may be tight with OS
  • Qwen 3 0.6B for extremely fast, simple tasks
  • Avoid: 32B dense (doesn't fit)

24–32 GB RAM (M4 Pro base, M2 Pro)

  • Best pick: Qwen 3 30B A3B (Q4, ~16 GB) — 92+ tok/s
  • Qwen 3 8B comfortably at Q8
  • Qwen 3 32B won't fit at Q4 (needs ~20 GB + OS overhead)
  • Run 30B A3B at Q4 for the best quality-per-speed ratio

36–64 GB RAM (M4 Max, M3 Max)

  • Best pick for quality: Qwen 3 32B (Q4_K_M, ~20 GB) — 22 tok/s dense quality
  • Best pick for speed: Qwen 3 30B A3B (Q8, ~30 GB) — 52+ tok/s
  • Can run 30B A3B Q6 at 77 tok/s with higher quality than Q4
  • 64 GB comfortably handles Qwen 3 32B at Q8 (~34 GB)

128 GB RAM (M4 Max 128 GB, M1/M2/M3 Ultra)

  • Best pick: Qwen 3 32B at Q8 — full quality, ~32 GB, plenty of headroom
  • Qwen 3 235B A22B at Q4 fits (~140 GB) — 8.1 tok/s on M4 Max 128 GB
  • Run 30B A3B at any quantization with full context window
  • Multiple models loaded simultaneously possible

Qwen 3 4B quantization ladder — M4 Max (40-core GPU, 64 GB)

The 4B variant is extremely fast on M4 Max and stays useful even at Q8 (111 tok/s). Great for real-time interactive use.

QuantizationRAM requiredAvg tok/s
Q4_G32~2.5 GB149.1 tok/s
Q4~2.4 GB148.1 tok/s
Q5~2.9 GB143.2 tok/s
Q5_G32~3.0 GB143.0 tok/s
Q6~3.4 GB136.6 tok/s
Q8~4.5 GB111.5 tok/s

Source: MLX reference runs on M4 Max 40-core GPU, 64 GB. See Ollama vs MLX for runtime differences.

Qwen 3 vs Llama 3 — which family?

Choose Qwen 3 if:

  • You need 30B+ quality at 8B speeds — 30B A3B MoE is unmatched
  • You need 128K context window (vs Llama 3.1's 128K for 8B only)
  • You want the latest model family with strong multilingual support
  • You have 24+ GB RAM and want the best quality-per-tok tradeoff
  • Coding and reasoning are primary tasks (Qwen 3 trained specifically for this)

Choose Llama 3 if:

  • You have only 8 GB and need a capable model (Llama 3.1 8B at ~5 GB)
  • Ecosystem compatibility is important — Llama is more widely supported
  • You want the simplest setup with the widest runtime support
  • You need the best 70B model (Llama 3.3 70B is top-tier dense)
  • Running on Ollama with default settings — Llama variants are better tested

See Llama vs Qwen detailed comparison →

Qwen 3 235B A22B — the 128 GB+ MoE frontier

Qwen 3 235B A22B activates 22B parameters per token from a 235B total pool. At Q4_K_M, it requires ~140 GB — only 128 GB+ Macs can run it (and it will be tight).

Performance on M4 Max 128 GB

  • 8.1 tok/s at Q4_K_M — slow but functional
  • This is frontier-tier on local hardware
  • Context window: 10K tokens measured (128K max spec)
  • Similar quality tier to GPT-4-class on benchmarks

Which hardware to run it on

  • M4 Max 128 GB: 8.1 tok/s (confirmed, reference run)
  • M2 Ultra 192 GB: likely fits better; faster throughput expected
  • M3 Ultra 256 GB: most comfortable, higher throughput
  • At ~8 tok/s, use for batch/background processing, not live chat

benchmarks.json — full dataset  ·  chips.json — chip summaries  ·  benchmarks.csv — CSV export

See all benchmarks →