llama-3-1-8b-instruct — ranking first, raw rows below

Start with the ranked Mac table above. Use the rest of this page to inspect raw Apple Silicon coverage and model metadata.

Quantizations observed: Q4_K - Medium

52Benchmark rows

52Chip tiers covered

63.3Fastest avg tok/s (M3 Ultra (80-core GPU, 256 GB))

—Minimum RAM observed

Quick take

Fastest published result is 63.3 tok/s on M3 Ultra (80-core GPU, 256 GB) at Q4_K - Medium. Published runtimes include llamafile. Start with Rankings for the decision, then use the raw rows below to audit the evidence.

Based on 52 external benchmarks; no lab runs yet.

Published runtimes: llamafile.

Need the best Mac for this model? Use Buy Need a setup-first answer? Use Run Checking whether it fits? Use Fit Browse Macs by exact hardware Need the full audit trail? Use Bench Comparing against rented GPUs? Use AI Datacenter Index

Current published coverage

Published chip coverage includes M3 Ultra (80-core GPU, 256 GB), M3 Ultra (80-core GPU, 512 GB), M5 Max (32-core GPU, 36 GB), M2 Ultra (60-core GPU, 64 GB), M4 Max (40-core GPU, 48 GB) plus 47 more chip tiers. Fastest published row is 63.3 tok/s on M3 Ultra (80-core GPU, 256 GB) at Q4_K - Medium.

Raw benchmark rows for llama-3-1-8b-instruct

Rows stay below the ranking because this page is answer-first. Use them to inspect exact chips, quantizations, runtimes, and sources.

Chip	Quant	RAM req.	Context	Avg tok/s	Prompt tok/s	Runtime	Source
M3 Ultra (80-core GPU, 256 GB)	Q4_K - Medium	—	—	63.3 tok/s	1062.2 tok/s	llamafile	ref
M3 Ultra (80-core GPU, 512 GB)	Q4_K - Medium	—	—	62.7 tok/s	1109.5 tok/s	llamafile	ref
M5 Max (32-core GPU, 36 GB)	Q4_K - Medium	—	—	61.6 tok/s	630.3 tok/s	llamafile	ref
M2 Ultra (60-core GPU, 64 GB)	Q4_K - Medium	—	—	59.5 tok/s	703.3 tok/s	llamafile	ref
M4 Max (40-core GPU, 48 GB)	Q4_K - Medium	—	—	55.1 tok/s	663.4 tok/s	llamafile	ref
M1 Ultra (64-core GPU, 128 GB)	Q4_K - Medium	—	—	54.3 tok/s	667.6 tok/s	llamafile	ref
M4 Max (40-core GPU, 128 GB)	Q4_K - Medium	—	—	51.6 tok/s	617.6 tok/s	llamafile	ref
M1 Ultra (48-core GPU, 128 GB)	Q4_K - Medium	—	—	48.9 tok/s	534.4 tok/s	llamafile	ref
M4 Max (32-core GPU, 36 GB)	Q4_K - Medium	—	—	48.1 tok/s	570.3 tok/s	llamafile	ref
M4 Max (40-core GPU, 64 GB)	Q4_K - Medium	—	—	47.1 tok/s	557.1 tok/s	llamafile	ref
M2 Max (38-core GPU, 96 GB)	Q4_K - Medium	—	—	46.4 tok/s	484.1 tok/s	llamafile	ref
M3 Max (40-core GPU, 128 GB)	Q4_K - Medium	—	—	45.8 tok/s	587.1 tok/s	llamafile	ref
M2 Max (38-core GPU, 32 GB)	Q4_K - Medium	—	—	44.7 tok/s	473.7 tok/s	llamafile	ref
M1 Max (32-core GPU, 64 GB)	Q4_K - Medium	—	—	37.8 tok/s	380.7 tok/s	llamafile	ref
M3 Max (30-core GPU, 96 GB)	Q4_K - Medium	—	—	37.7 tok/s	456.8 tok/s	llamafile	ref
M3 Max (30-core GPU, 36 GB)	Q4_K - Medium	—	—	37.5 tok/s	443.5 tok/s	llamafile	ref
M1 Max (32-core GPU, 32 GB)	Q4_K - Medium	—	—	35.4 tok/s	359.9 tok/s	llamafile	ref
M4 Pro (20-core GPU, 64 GB)	Q4_K - Medium	—	—	32.9 tok/s	349.4 tok/s	llamafile	ref
M4 Pro (20-core GPU, 48 GB)	Q4_K - Medium	—	—	32.7 tok/s	361.9 tok/s	llamafile	ref
M4 Pro (20-core GPU, 24 GB)	Q4_K - Medium	—	—	32.5 tok/s	359.6 tok/s	llamafile	ref
M1 Max (24-core GPU, 64 GB)	Q4_K - Medium	—	—	32.1 tok/s	306.2 tok/s	llamafile	ref
M2 Max (30-core GPU, 32 GB)	Q4_K - Medium	—	—	31.2 tok/s	325.0 tok/s	llamafile	ref
M4 Pro (16-core GPU, 24 GB)	Q4_K - Medium	—	—	30.5 tok/s	298.0 tok/s	llamafile	ref
M4 Pro (16-core GPU, 48 GB)	Q4_K - Medium	—	—	30.2 tok/s	302.4 tok/s	llamafile	ref
M2 Pro (19-core GPU, 32 GB)	Q4_K - Medium	—	—	26.3 tok/s	261.8 tok/s	llamafile	ref
M3 Max (40-core GPU, 64 GB)	Q4_K - Medium	—	—	25.4 tok/s	377.0 tok/s	llamafile	ref
M2 Pro (16-core GPU, 16 GB)	Q4_K - Medium	—	—	24.3 tok/s	224.6 tok/s	llamafile	ref
M2 Pro (16-core GPU, 32 GB)	Q4_K - Medium	—	—	23.8 tok/s	208.7 tok/s	llamafile	ref
M5 (10-core GPU, 32 GB)	Q4_K - Medium	—	—	22.3 tok/s	210.7 tok/s	llamafile	ref
M3 Pro (18-core GPU, 36 GB)	Q4_K - Medium	—	—	22.1 tok/s	283.8 tok/s	llamafile	ref
M1 Pro (16-core GPU, 16 GB)	Q4_K - Medium	—	—	21.9 tok/s	204.5 tok/s	llamafile	ref
M1 Pro (16-core GPU, 32 GB)	Q4_K - Medium	—	—	21.7 tok/s	200.3 tok/s	llamafile	ref
M3 Pro (14-core GPU, 36 GB)	Q4_K - Medium	—	—	21.5 tok/s	223.4 tok/s	llamafile	ref
M3 Pro (18-core GPU, 18 GB)	Q4_K - Medium	—	—	20.8 tok/s	279.2 tok/s	llamafile	ref
M1 Pro (14-core GPU, 16 GB)	Q4_K - Medium	—	—	20.1 tok/s	177.0 tok/s	llamafile	ref
M1 Pro (14-core GPU, 32 GB)	Q4_K - Medium	—	—	20.0 tok/s	173.1 tok/s	llamafile	ref
M3 Pro (14-core GPU, 18 GB)	Q4_K - Medium	—	—	19.1 tok/s	199.5 tok/s	llamafile	ref
M2 (8-core GPU, 8 GB)	Q4_K - Medium	—	—	18.3 tok/s	148.8 tok/s	llamafile	ref
M4 (10-core GPU, 32 GB)	Q4_K - Medium	—	—	16.8 tok/s	166.1 tok/s	llamafile	ref
M4 (10-core GPU, 16 GB)	Q4_K - Medium	—	—	16.0 tok/s	166.8 tok/s	llamafile	ref
M4 (10-core GPU, 24 GB)	Q4_K - Medium	—	—	15.9 tok/s	149.0 tok/s	llamafile	ref
M4 (8-core GPU, 16 GB)	Q4_K - Medium	—	—	15.3 tok/s	134.0 tok/s	llamafile	ref
M1 Ultra (GPU count not published, 128 GB)	Q4_K - Medium	—	—	15.2 tok/s	71.6 tok/s	llamafile	ref
M2 (10-core GPU, 16 GB)	Q4_K - Medium	—	—	14.7 tok/s	140.7 tok/s	llamafile	ref
M2 (10-core GPU, 24 GB)	Q4_K - Medium	—	—	14.7 tok/s	141.7 tok/s	llamafile	ref
M1 (8-core GPU, 8 GB)	Q4_K - Medium	—	—	14.6 tok/s	133.9 tok/s	llamafile	ref
M3 (10-core GPU, 16 GB)	Q4_K - Medium	—	—	13.5 tok/s	138.7 tok/s	llamafile	ref
M1 (7-core GPU, 8 GB)	Q4_K - Medium	—	—	13.4 tok/s	108.7 tok/s	llamafile	ref
M2 (8-core GPU, 16 GB)	Q4_K - Medium	—	—	12.9 tok/s	114.0 tok/s	llamafile	ref
M3 (GPU count not published, 16 GB)	Q4_K - Medium	—	—	11.8 tok/s	33.6 tok/s	llamafile	ref
M3 (10-core GPU, 24 GB)	Q4_K - Medium	—	—	10.2 tok/s	109.9 tok/s	llamafile	ref
M1 (7-core GPU, 16 GB)	Q4_K - Medium	—	—	9.4 tok/s	83.9 tok/s	llamafile	ref

Chips with published results for llama-3-1-8b-instruct

M3 Ultra (80-core GPU, 256 GB)M3 Ultra (80-core GPU, 512 GB)M5 Max (32-core GPU, 36 GB)M2 Ultra (60-core GPU, 64 GB)M4 Max (40-core GPU, 48 GB)M1 Ultra (64-core GPU, 128 GB)M4 Max (40-core GPU, 128 GB)M1 Ultra (48-core GPU, 128 GB)M4 Max (32-core GPU, 36 GB)M4 Max (40-core GPU, 64 GB)M2 Max (38-core GPU, 96 GB)M3 Max (40-core GPU, 128 GB)M2 Max (38-core GPU, 32 GB)M1 Max (32-core GPU, 64 GB)M3 Max (30-core GPU, 96 GB)M3 Max (30-core GPU, 36 GB)M1 Max (32-core GPU, 32 GB)M4 Pro (20-core GPU, 64 GB)M4 Pro (20-core GPU, 48 GB)M4 Pro (20-core GPU, 24 GB)M1 Max (24-core GPU, 64 GB)M2 Max (30-core GPU, 32 GB)M4 Pro (16-core GPU, 24 GB)M4 Pro (16-core GPU, 48 GB)M2 Pro (19-core GPU, 32 GB)M3 Max (40-core GPU, 64 GB)M2 Pro (16-core GPU, 16 GB)M2 Pro (16-core GPU, 32 GB)M5 (10-core GPU, 32 GB)M3 Pro (18-core GPU, 36 GB)M1 Pro (16-core GPU, 16 GB)M1 Pro (16-core GPU, 32 GB)M3 Pro (14-core GPU, 36 GB)M3 Pro (18-core GPU, 18 GB)M1 Pro (14-core GPU, 16 GB)M1 Pro (14-core GPU, 32 GB)M3 Pro (14-core GPU, 18 GB)M2 (8-core GPU, 8 GB)M4 (10-core GPU, 32 GB)M4 (10-core GPU, 16 GB)M4 (10-core GPU, 24 GB)M4 (8-core GPU, 16 GB)M1 Ultra (GPU count not published, 128 GB)M2 (10-core GPU, 16 GB)M2 (10-core GPU, 24 GB)M1 (8-core GPU, 8 GB)M3 (10-core GPU, 16 GB)M1 (7-core GPU, 8 GB)M2 (8-core GPU, 16 GB)M3 (GPU count not published, 16 GB)M3 (10-core GPU, 24 GB)M1 (7-core GPU, 16 GB)

Data

benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv — CSV export

See all models →