← All benchmarks

Coding Assistant

Coding assistants need fast token generation and low latency. A 7B–14B model at Q4–Q8 gives you near-instant responses. Here is the benchmark data to guide your hardware choice.

7B–14BTypical model size
16–24 GBRecommended RAM
Llama 3.1 8B, Llama 3.2 1BKey models
20Benchmark rows

Why these models for this use case

Coding assistants benefit from speed over size. A 7B model at Q8 on M4 runs at 60–80 tok/s — fast enough that responses appear nearly instant. The 14B tier gives meaningfully better code quality with only modest speed sacrifice. Models in the Qwen 2.5 and Llama 3 families are popular for coding because they were trained on large code corpora. 32B+ models are overkill for autocomplete but useful for complex refactoring tasks.

Benchmark results — fastest rows first

Filtered to models commonly used for coding assistant. Sorted by avg tok/s descending.

ChipModelQuantRAM req.Avg tok/sRuntimeSource
M4 Max (40-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium182.6 tok/sref
M4 Max (40-core GPU, 64 GB)Llama 3.2 1B InstructQ4_K - Medium180.3 tok/sref
M4 Max (40-core GPU, 48 GB)Llama 3.2 1B InstructQ4_K - Medium179.0 tok/sref
M3 Ultra (80-core GPU, 512 GB)Llama 3.2 1B InstructQ4_K - Medium178.8 tok/sref
M3 Ultra (80-core GPU, 256 GB)Llama 3.2 1B InstructQ4_K - Medium177.9 tok/sref
M2 Ultra (60-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium176.4 tok/sref
M2 Ultra (60-core GPU, 64 GB)Llama 3.2 1B InstructQ4_K - Medium174.1 tok/sref
M2 Ultra (60-core GPU, 192 GB)Llama 3.2 1B InstructQ4_K - Medium169.8 tok/sref
M4 Max (32-core GPU, 36 GB)Llama 3.2 1B InstructQ4_K - Medium166.5 tok/sref
M4 Max (GPU count not published, 128 GB)Llama 3.2 1B InstructQ4_K - Medium156.3 tok/sref
M2 Max (38-core GPU, 32 GB)Llama 3.2 1B InstructQ4_K - Medium153.0 tok/sref
M1 Ultra (64-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium151.1 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ4_G322.78 GB149.1 tok/sMLXref
M3 Max (40-core GPU, 48 GB)Llama 3.2 1B InstructQ4_K - Medium149.0 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ42.54 GB148.1 tok/sMLXref
M3 Max (40-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium146.3 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ53.26 GB143.2 tok/sMLXref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ5_G323.5 GB143.0 tok/sMLXref
M1 Ultra (48-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium138.0 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ63.98 GB136.6 tok/sMLXref

Recommended chips for this use case

Other use cases

benchmarks.json — full dataset  ·  models.json — model summaries  ·  benchmarks.csv

Buying guide: best Mac for local LLMs →