llama-3-1-8b-instruct — ranking first, raw rows below
Start with the ranked Mac table above. Use the rest of this page to inspect raw Apple Silicon coverage and model metadata.
Quantizations observed: Q4_K - Medium
Quick take
Fastest published result is 63.3 tok/s on M3 Ultra (80-core GPU, 256 GB) at Q4_K - Medium. Published runtimes include llamafile. Start with Rankings for the decision, then use the raw rows below to audit the evidence.
Based on 52 external benchmarks; no lab runs yet.
Published runtimes: llamafile.
Current published coverage
Published chip coverage includes M3 Ultra (80-core GPU, 256 GB), M3 Ultra (80-core GPU, 512 GB), M5 Max (32-core GPU, 36 GB), M2 Ultra (60-core GPU, 64 GB), M4 Max (40-core GPU, 48 GB) plus 47 more chip tiers. Fastest published row is 63.3 tok/s on M3 Ultra (80-core GPU, 256 GB) at Q4_K - Medium.
Raw benchmark rows for llama-3-1-8b-instruct
Rows stay below the ranking because this page is answer-first. Use them to inspect exact chips, quantizations, runtimes, and sources.
| Chip | Quant | Avg tok/s | Runtime | Source |
|---|---|---|---|---|
| M3 Ultra (80-core GPU, 256 GB) | Q4_K - Medium | 63.3 tok/s | llamafile | ref |
| M3 Ultra (80-core GPU, 512 GB) | Q4_K - Medium | 62.7 tok/s | llamafile | ref |
| M5 Max (32-core GPU, 36 GB) | Q4_K - Medium | 61.6 tok/s | llamafile | ref |
| M2 Ultra (60-core GPU, 64 GB) | Q4_K - Medium | 59.5 tok/s | llamafile | ref |
| M4 Max (40-core GPU, 48 GB) | Q4_K - Medium | 55.1 tok/s | llamafile | ref |
| M1 Ultra (64-core GPU, 128 GB) | Q4_K - Medium | 54.3 tok/s | llamafile | ref |
| M4 Max (40-core GPU, 128 GB) | Q4_K - Medium | 51.6 tok/s | llamafile | ref |
| M1 Ultra (48-core GPU, 128 GB) | Q4_K - Medium | 48.9 tok/s | llamafile | ref |
| M4 Max (32-core GPU, 36 GB) | Q4_K - Medium | 48.1 tok/s | llamafile | ref |
| M4 Max (40-core GPU, 64 GB) | Q4_K - Medium | 47.1 tok/s | llamafile | ref |
| M2 Max (38-core GPU, 96 GB) | Q4_K - Medium | 46.4 tok/s | llamafile | ref |
| M3 Max (40-core GPU, 128 GB) | Q4_K - Medium | 45.8 tok/s | llamafile | ref |
| M2 Max (38-core GPU, 32 GB) | Q4_K - Medium | 44.7 tok/s | llamafile | ref |
| M1 Max (32-core GPU, 64 GB) | Q4_K - Medium | 37.8 tok/s | llamafile | ref |
| M3 Max (30-core GPU, 96 GB) | Q4_K - Medium | 37.7 tok/s | llamafile | ref |
| M3 Max (30-core GPU, 36 GB) | Q4_K - Medium | 37.5 tok/s | llamafile | ref |
| M1 Max (32-core GPU, 32 GB) | Q4_K - Medium | 35.4 tok/s | llamafile | ref |
| M4 Pro (20-core GPU, 64 GB) | Q4_K - Medium | 32.9 tok/s | llamafile | ref |
| M4 Pro (20-core GPU, 48 GB) | Q4_K - Medium | 32.7 tok/s | llamafile | ref |
| M4 Pro (20-core GPU, 24 GB) | Q4_K - Medium | 32.5 tok/s | llamafile | ref |
| M1 Max (24-core GPU, 64 GB) | Q4_K - Medium | 32.1 tok/s | llamafile | ref |
| M2 Max (30-core GPU, 32 GB) | Q4_K - Medium | 31.2 tok/s | llamafile | ref |
| M4 Pro (16-core GPU, 24 GB) | Q4_K - Medium | 30.5 tok/s | llamafile | ref |
| M4 Pro (16-core GPU, 48 GB) | Q4_K - Medium | 30.2 tok/s | llamafile | ref |
| M2 Pro (19-core GPU, 32 GB) | Q4_K - Medium | 26.3 tok/s | llamafile | ref |
| M3 Max (40-core GPU, 64 GB) | Q4_K - Medium | 25.4 tok/s | llamafile | ref |
| M2 Pro (16-core GPU, 16 GB) | Q4_K - Medium | 24.3 tok/s | llamafile | ref |
| M2 Pro (16-core GPU, 32 GB) | Q4_K - Medium | 23.8 tok/s | llamafile | ref |
| M5 (10-core GPU, 32 GB) | Q4_K - Medium | 22.3 tok/s | llamafile | ref |
| M3 Pro (18-core GPU, 36 GB) | Q4_K - Medium | 22.1 tok/s | llamafile | ref |
| M1 Pro (16-core GPU, 16 GB) | Q4_K - Medium | 21.9 tok/s | llamafile | ref |
| M1 Pro (16-core GPU, 32 GB) | Q4_K - Medium | 21.7 tok/s | llamafile | ref |
| M3 Pro (14-core GPU, 36 GB) | Q4_K - Medium | 21.5 tok/s | llamafile | ref |
| M3 Pro (18-core GPU, 18 GB) | Q4_K - Medium | 20.8 tok/s | llamafile | ref |
| M1 Pro (14-core GPU, 16 GB) | Q4_K - Medium | 20.1 tok/s | llamafile | ref |
| M1 Pro (14-core GPU, 32 GB) | Q4_K - Medium | 20.0 tok/s | llamafile | ref |
| M3 Pro (14-core GPU, 18 GB) | Q4_K - Medium | 19.1 tok/s | llamafile | ref |
| M2 (8-core GPU, 8 GB) | Q4_K - Medium | 18.3 tok/s | llamafile | ref |
| M4 (10-core GPU, 32 GB) | Q4_K - Medium | 16.8 tok/s | llamafile | ref |
| M4 (10-core GPU, 16 GB) | Q4_K - Medium | 16.0 tok/s | llamafile | ref |
| M4 (10-core GPU, 24 GB) | Q4_K - Medium | 15.9 tok/s | llamafile | ref |
| M4 (8-core GPU, 16 GB) | Q4_K - Medium | 15.3 tok/s | llamafile | ref |
| M1 Ultra (GPU count not published, 128 GB) | Q4_K - Medium | 15.2 tok/s | llamafile | ref |
| M2 (10-core GPU, 16 GB) | Q4_K - Medium | 14.7 tok/s | llamafile | ref |
| M2 (10-core GPU, 24 GB) | Q4_K - Medium | 14.7 tok/s | llamafile | ref |
| M1 (8-core GPU, 8 GB) | Q4_K - Medium | 14.6 tok/s | llamafile | ref |
| M3 (10-core GPU, 16 GB) | Q4_K - Medium | 13.5 tok/s | llamafile | ref |
| M1 (7-core GPU, 8 GB) | Q4_K - Medium | 13.4 tok/s | llamafile | ref |
| M2 (8-core GPU, 16 GB) | Q4_K - Medium | 12.9 tok/s | llamafile | ref |
| M3 (GPU count not published, 16 GB) | Q4_K - Medium | 11.8 tok/s | llamafile | ref |
| M3 (10-core GPU, 24 GB) | Q4_K - Medium | 10.2 tok/s | llamafile | ref |
| M1 (7-core GPU, 16 GB) | Q4_K - Medium | 9.4 tok/s | llamafile | ref |
Chips with published results for llama-3-1-8b-instruct
Data
benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv — CSV export