← All benchmarks

Offline Chat

Offline chat means running the full model locally — your conversations never leave your machine. The sweet spot is 7B–14B models for speed, or 32B for noticeably better quality.

7B–32BTypical model size
16–64 GBRecommended RAM
Llama 3.1 8B, Llama 3.2 1BKey models
20Benchmark rows

Why these models for this use case

Offline chat has a wide model range. For casual use, a 7B model at Q8 runs at 60–80 tok/s and feels fast. For more thoughtful responses, 14B at Q4 is a good middle ground. If you want GPT-3.5 class quality offline, 32B models are the target — and you need at least 24 GB RAM with a Q4 quantization (~20 GB). Ollama and LM Studio both support all these configurations with zero configuration.

Benchmark results — fastest rows first

Filtered to models commonly used for offline chat. Sorted by avg tok/s descending.

ChipModelQuantRAM req.Avg tok/sRuntimeSource
M4 Max (40-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium182.6 tok/sref
M4 Max (40-core GPU, 64 GB)Llama 3.2 1B InstructQ4_K - Medium180.3 tok/sref
M4 Max (40-core GPU, 48 GB)Llama 3.2 1B InstructQ4_K - Medium179.0 tok/sref
M3 Ultra (80-core GPU, 512 GB)Llama 3.2 1B InstructQ4_K - Medium178.8 tok/sref
M3 Ultra (80-core GPU, 256 GB)Llama 3.2 1B InstructQ4_K - Medium177.9 tok/sref
M2 Ultra (60-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium176.4 tok/sref
M2 Ultra (60-core GPU, 64 GB)Llama 3.2 1B InstructQ4_K - Medium174.1 tok/sref
M2 Ultra (60-core GPU, 192 GB)Llama 3.2 1B InstructQ4_K - Medium169.8 tok/sref
M4 Max (32-core GPU, 36 GB)Llama 3.2 1B InstructQ4_K - Medium166.5 tok/sref
M4 Max (GPU count not published, 128 GB)Llama 3.2 1B InstructQ4_K - Medium156.3 tok/sref
M2 Max (38-core GPU, 32 GB)Llama 3.2 1B InstructQ4_K - Medium153.0 tok/sref
M1 Ultra (64-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium151.1 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ4_G322.78 GB149.1 tok/sMLXref
M3 Max (40-core GPU, 48 GB)Llama 3.2 1B InstructQ4_K - Medium149.0 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ42.54 GB148.1 tok/sMLXref
M3 Max (40-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium146.3 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ53.26 GB143.2 tok/sMLXref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ5_G323.5 GB143.0 tok/sMLXref
M1 Ultra (48-core GPU, 128 GB)Llama 3.2 1B InstructQ4_K - Medium138.0 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 4BQ63.98 GB136.6 tok/sMLXref

Recommended chips for this use case

Other use cases

benchmarks.json — full dataset  ·  models.json — model summaries  ·  benchmarks.csv

Buying guide: best Mac for local LLMs →