Offline Chat
Offline chat means running the full model locally — your conversations never leave your machine. The sweet spot is 7B–14B models for speed, or 32B for noticeably better quality.
7B–32BTypical model size
16–64 GBRecommended RAM
Llama 3.1 8B, Llama 3.2 1BKey models
20Benchmark rows
Why these models for this use case
Offline chat has a wide model range. For casual use, a 7B model at Q8 runs at 60–80 tok/s and feels fast. For more thoughtful responses, 14B at Q4 is a good middle ground. If you want GPT-3.5 class quality offline, 32B models are the target — and you need at least 24 GB RAM with a Q4 quantization (~20 GB). Ollama and LM Studio both support all these configurations with zero configuration.
Benchmark results — fastest rows first
Filtered to models commonly used for offline chat. Sorted by avg tok/s descending.
Recommended chips for this use case
Other use cases
Data
benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv