Which local LLM translates best? A reproducible eval

How good are local LLMs at translation, and do you actually need the cloud? A reproducible benchmark of 24 on-device, self-hosted, and cloud models translating into English, with the low-resource case (Afrikaans) front and centre. The headline: on Afrikaans→English a local 18 GB model lands in a statistical tie with frontier cloud. Same blinded Tatoeba sentences, same prompt, greedy decoding, scored multi-reference with COMET (meaning) and chrF++ (surface). Built to pick a translation model for Lector.

Open and reproducible: harness, every model's raw outputs, and the seeded test sets: github.com/heuwels/llm-lang-eval

On Afrikaans the field is tightly bunched: 20 of 24 models fall within ~1.5 COMET (sampling noise) of the top score (≈95), a statistical tie, not a ranking. The self-hosted 18 GB gemma-4-12b-qat (95.0) sits in that band alongside frontier cloud, so for Afrikaans→English, you don't need the cloud or a big box.

Leaderboard

All 24 models across three languages on one zoomed COMET axis: each row a model (ranked by Afrikaans), one dot per language, the connector its cross-language spread. The zoom makes the tight differences legible; gaps under ~1.5 COMET are sampling noise (see Significance).

Afrikaans German Spanish · COMET, zoomed axis 88 to 96; dot before each name = tier

Deployment tier:on-device (laptop) self-hosted box (18 GB) cloud (OpenRouter)

The numbers

COMET (meaning, ×100) over chrF++ (surface), per language. chrF++ rewards character overlap with the reference, so it docks valid paraphrases ("scenery is magnificent" vs "landscape is breathtaking"); COMET scores meaning and credits them. Green = leading band (within ~1.5 COMET, a statistical tie, not a single winner). Rows share the chart's order (ranked by Afrikaans COMET), so near-ties can sit a place apart despite equal rounded scores. n=200 per language.

Model	Afrikaans COMET · chrF++	German COMET · chrF++	Spanish COMET · chrF++
gpt-5	95.3chrF 83.5	93.1chrF 75.1	93.9chrF 78.1
gemini-2.5-pro	95.3chrF 83.0	93.5chrF 75.8	94.4chrF 80.3
claude-opus-4.8	95.1chrF 82.1	93.3chrF 75.6	94.5chrF 80.5
claude-sonnet-4.6	95.1chrF 81.2	93.1chrF 74.8	94.6chrF 81.3
gemma-4-12b-qat	95.0chrF 82.8	93.1chrF 74.5	93.6chrF 78.4
gpt-4o-mini	95.0chrF 81.3	92.9chrF 74.0	94.3chrF 79.3
gpt-4o	94.9chrF 81.0	93.4chrF 75.0	94.2chrF 79.3
mistral-large	94.8chrF 81.8	93.2chrF 75.3	94.0chrF 78.5
llama-3.3-70b	94.7chrF 82.3	92.9chrF 74.8	93.8chrF 78.1
gemini-2.5-flash	94.7chrF 82.5	92.4chrF 75.0	92.4chrF 77.4
deepseek-v3.2	94.7chrF 80.9	92.6chrF 72.9	94.0chrF 78.3
gemma-2-27b	94.5chrF 79.7	91.2chrF 71.4	92.2chrF 76.6
gemma-4-12b	94.5chrF 81.5	93.0chrF 74.4	93.3chrF 77.5
gemma-3-12b	94.5chrF 80.2	92.2chrF 72.3	92.6chrF 75.2
gemma-3-27b	94.4chrF 79.6	92.6chrF 72.4	92.9chrF 75.7
claude-haiku-4.5	94.3chrF 80.2	92.9chrF 74.8	94.4chrF 79.9
ministral-3-14b	94.3chrF 78.2	91.8chrF 71.5	93.5chrF 76.0
gemma-4-e4b	94.2chrF 80.3	92.2chrF 70.9	93.1chrF 77.8
qwen3.5-9b	94.1chrF 78.9	92.2chrF 71.8	93.0chrF 75.1
mistral-small-3.2-cloud	94.0chrF 79.8	92.8chrF 73.7	93.5chrF 77.5
gemma-3n-e4b	93.7chrF 77.0	92.3chrF 71.1	92.9chrF 75.0
ollama-llama3.1-8b	93.5chrF 79.3	91.6chrF 70.4	92.6chrF 75.7
apfel-foundation	92.1chrF 74.5	89.6chrF 64.1	90.9chrF 70.6
qwen2.5-coder-14b	91.8chrF 74.3	91.4chrF 68.6	93.0chrF 76.1

Cost, and what you can actually run

Cloud models fill the top of that board, but this is a study of local models, and most of the cloud field can't run on the box at all. The frontier APIs (GPT-5, Claude, Gemini) are closed weights, so self-hosting them was never an option. The open models I could reach through OpenRouter are mostly too big for the hardware: Llama 3.3 is 70B, Mistral Large is larger again, and even the 24B and 27B open models (Mistral Small, Gemma 2 and 3 at 27B) sit at or past the ceiling of an 18 GB Mac once the OS and the KV cache take their share. What genuinely fits is the on-device and self-hosted-box tiers, so here is that field on its own.

Afrikaans German Spanish · COMET, zoomed axis 88 to 96; dot before each name = tier

The strongest model that actually fits, gemma-4-12b-qat at 7.5 GB, is the same one sitting in the frontier band up top. apfel-foundation is Apple's built-in Foundation model, the one that ships with macOS, run through the Apfel harness. It scores respectably on what it answers, but it refused or errored on 26% of the Afrikaans sentences (52 of 200), against 8% in Spanish, and Apple doesn't list Afrikaans among its supported languages, which is why it sits near the bottom.

Cost: use what you've got

Cost barely enters into it. A translation is tiny, roughly 80 tokens, so the entire cloud sweep (24 models across three languages, plus the holdout and the cloze probe) came to $13.62 on OpenRouter, well under a cent per translation even on the frontier models. Per-token pricing still spans an order or two of magnitude, the frontier APIs against the cheap tiers like Gemini Flash or GPT-4o-mini, but at this token count the absolute bill is small whichever way you go.

What moves the decision is what you already own. A spare Mac is a sunk cost, so a local model is free per lookup beyond the electricity. An existing Claude plan is free at the margin too, within its limits, which is why I reach for the Anthropic OAuth route first. OpenRouter is the only one of the three that adds a real per-token bill, and it earns its place when you need a specific model you can't self-host or don't have a plan for. So the honest answer is usually to use what you've got: a spare box runs a local model, an existing plan already covers the lookups, and with neither, the cheap cloud tier is pennies per thousand. Since a 12B you can run at home already ties the frontier on Afrikaans, paying frontier rates per token buys very little for this particular job.

Contamination check: does it survive on unseen data?

The honest limitation. Tatoeba is almost certainly in every model's pretraining, so a high score can mean "translated well" or "regurgitated a memorised pair". The score alone can't tell us which. To bound it, each model is compared on two matched 150-sentence Afrikaans samples (same length filter): pre-2023 (added 2010 to 2022, almost certainly seen in training) versus 2025-26 (added after the training cutoff of the older-generation models here, so they cannot have memorised them). A large drop on the recent set is the fingerprint of memorisation; a stable score is evidence of genuine translation ability.

Model	pre-2023 COMET	2025-26 COMET	Δ
gemma-4-12b-qat	94.6	93.4	-1.2
gemma-3-12b	94.0	93.2	-0.8
gemma-4-12b	94.5	93.1	-1.4
gemma-3n-e4b	93.8	92.5	-1.4
qwen3.5-9b	93.3	92.1	-1.2
ministral-3-14b	93.8	92.0	-1.8
gemma-4-e4b	94.0	91.8	-2.2
qwen2.5-coder-14b	90.3	88.9	-1.4

Caveat on the caveat: exact training-cutoff dates aren't published for every model, and recently-added sentences may differ subtly in style or difficulty, so read a small Δ as "holds up", not as a precise measurement of contamination.

Parroting probe: memorisation, measured directly

The sharpest contamination test. We blank one informative word per sentence and ask each model to fill it. On unseen (2025-26) sentences it can only predict from context; if it recovers the exact original word much more often on seen (pre-2023) sentences, that gap is the model parroting memorised text rather than reasoning about the language. (It doubles as a cloze-ability score, Lector's own practice task.)

Model	seen recovery %	unseen recovery %	gap
claude-opus-4.8	42	31	+11
qwen3.5-9b	10	7	+4
gemma-3-12b	17	15	+2
gemma-4-12b-qat	18	17	+1
gemma-3n-e4b	6	7	-1
gemma-4-e4b	6	7	-1

Recovery = exact match of the blanked word. A large positive gap = memorisation; near-zero = genuine context prediction. n≈150 per cell, so gaps within ~±10 are noise.

Side-by-side generations

The numbers only say so much. Here are the actual translations where models disagree most. Green marks the highest per-sentence chrF++ for that sentence.

Afrikaans: where the models split

Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.

Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.

afrNiemand het gekom nie.

refNo one came. · Nobody came.

12×100

gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, llama-3.3-70b, gemini-2.5-flash, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-4-e4b, gemma-3n-e4b, ollama-llama3.1-8b

Nobody came.

11×100

gpt-5, gpt-4o-mini, gpt-4o, mistral-large, deepseek-v3.2, gemma-2-27b, gemma-4-12b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, qwen2.5-coder-14b

No one came.

+1 one-off wordings (apfel-foundation)

afrDie meisie in die blou jas is my dogter.

refThe girl in the blue coat is my daughter.

11×82

gpt-4o-mini, llama-3.3-70b, deepseek-v3.2, gemma-2-27b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b

The girl in the blue jacket is my daughter.

10×100

gpt-5, gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o, mistral-large, gemini-2.5-flash, gemma-4-12b, gemma-3-12b

The girl in the blue coat is my daughter.

2×79

ollama-llama3.1-8b, qwen2.5-coder-14b

The girl in the blue dress is my daughter.

afrHierdie tomaties het geen smaak nie.

refThese tomatoes don't have any taste.

11×57

gpt-5, gemini-2.5-pro, claude-opus-4.8, llama-3.3-70b, gemini-2.5-flash, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b

These tomatoes have no taste.

10×44

gemma-4-12b-qat, gpt-4o-mini, gpt-4o, deepseek-v3.2, gemma-2-27b, gemma-4-12b, gemma-3-12b, gemma-3-27b, qwen3.5-9b, gemma-3n-e4b

These tomatoes have no flavor.

+3 one-off wordings (claude-sonnet-4.6, mistral-large, apfel-foundation)

afrOns het 'n groot tuin.

refWe have a big garden. · We have a big yard.

14×100

gpt-5, gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, llama-3.3-70b, deepseek-v3.2, gemma-4-12b, gemma-3-12b, gemma-4-e4b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b, qwen2.5-coder-14b

We have a big garden.

10×63

gpt-4o-mini, gpt-4o, mistral-large, gemini-2.5-flash, gemma-2-27b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, qwen3.5-9b, apfel-foundation

We have a large garden.

afrDit lyk asof ons sukkel met die selfde ou probleem.

refWe seem to keep grappling with the same old problem.

10×59

gpt-5, gemini-2.5-pro, claude-sonnet-4.6, gemma-4-12b-qat, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, qwen3.5-9b, gemma-3n-e4b, ollama-llama3.1-8b

It looks like we’re struggling with the same old problem.

9×61

claude-opus-4.8, gpt-4o, mistral-large, llama-3.3-70b, deepseek-v3.2, gemma-3-12b, ministral-3-14b, mistral-small-3.2-cloud, apfel-foundation

It seems like we're struggling with the same old problem.

2×61

gemini-2.5-flash, gemma-3-27b

It seems we're struggling with the same old problem.

+3 one-off wordings (gpt-4o-mini, gemma-4-e4b, qwen2.5-coder-14b)

German: where the models split

deuAlle meine Kinder wurden in Boston geboren.

refAll of my children were born in Boston.

12×87

gpt-5, gemini-2.5-pro, claude-sonnet-4.6, gpt-4o-mini, gpt-4o, gemini-2.5-flash, gemma-2-27b, gemma-3-12b, gemma-4-e4b, mistral-small-3.2-cloud, ollama-llama3.1-8b, apfel-foundation

All my children were born in Boston.

12×100

claude-opus-4.8, gemma-4-12b-qat, mistral-large, llama-3.3-70b, deepseek-v3.2, gemma-4-12b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, qwen3.5-9b, gemma-3n-e4b, qwen2.5-coder-14b

All of my children were born in Boston.

deuEr studiert an der Technischen Universität.

refHe studies at the technical university. · He's studying at the technical university.

12×65

gpt-5, gemini-2.5-pro, gpt-4o-mini, gpt-4o, mistral-large, llama-3.3-70b, gemma-2-27b, gemma-3-12b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, apfel-foundation

He is studying at the Technical University.

12×73

claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, gemma-3n-e4b, qwen2.5-coder-14b

He studies at the Technical University.

deuDu darfst das Buch lesen.

refYou may read this book.

11×72

gpt-5, claude-opus-4.8, gemma-4-12b-qat, mistral-large, llama-3.3-70b, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, gemma-3-27b, ministral-3-14b, gemma-4-e4b

You may read the book.

11×38

gemini-2.5-pro, claude-sonnet-4.6, gpt-4o-mini, gpt-4o, gemma-2-27b, gemma-3-12b, claude-haiku-4.5, mistral-small-3.2-cloud, gemma-3n-e4b, apfel-foundation, qwen2.5-coder-14b

You are allowed to read the book.

2×37

qwen3.5-9b, ollama-llama3.1-8b

You're allowed to read the book.

deuNiemand hat Tom gerufen.

refNobody called Tom.

12×62

gpt-5, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, llama-3.3-70b, gemma-2-27b, gemma-4-12b, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, qwen2.5-coder-14b

No one called Tom.

11×100

gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, mistral-large, gemini-2.5-flash, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-3n-e4b, ollama-llama3.1-8b, apfel-foundation

Nobody called Tom.

+1 one-off wordings (deepseek-v3.2)

deuDie Polizisten spielten auf der Polizeistation Schach.

refThe police officers were playing chess at the police station.

13×100

gpt-5, gemma-4-12b-qat, gpt-4o, mistral-large, llama-3.3-70b, gemma-4-12b, gemma-3-12b, gemma-3-27b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b

The police officers were playing chess at the police station.

11×78

gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gpt-4o-mini, gemini-2.5-flash, deepseek-v3.2, gemma-2-27b, claude-haiku-4.5, gemma-4-e4b, gemma-3n-e4b, apfel-foundation

The police officers played chess at the police station.

Spanish: where the models split

spaPodéis ver la televisión después de cenar.

refYou can watch television after dinner.

13×66

gpt-5, gemma-4-12b-qat, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-4-12b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b, apfel-foundation

You can watch TV after dinner.

11×100

gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gpt-4o-mini, deepseek-v3.2, gemma-2-27b, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-4-e4b, qwen2.5-coder-14b

You can watch television after dinner.

spaTenía 23 años de edad cuando pinté este cuadro.

refWhen I painted this picture, I was 23 years old.

13×83

gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-2-27b, gemma-3-12b, gemma-3-27b, gemma-3n-e4b

I was 23 years old when I painted this picture.

10×69

gpt-5, deepseek-v3.2, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b

I was 23 years old when I painted this painting.

+1 one-off wordings (apfel-foundation)

spaLa pelota de golf casi entró en el hoyo.

refThe golf ball almost went in the hole.

14×90

claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, mistral-large, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, mistral-small-3.2-cloud, apfel-foundation, qwen2.5-coder-14b

The golf ball almost went into the hole.

10×100

gpt-5, gemini-2.5-pro, llama-3.3-70b, gemma-2-27b, gemma-3-12b, gemma-3-27b, gemma-4-e4b, qwen3.5-9b, gemma-3n-e4b, ollama-llama3.1-8b

The golf ball almost went in the hole.

spaEstoy en casa de mis padres.

refI'm at my parents' house.

14×100

gpt-5, gemini-2.5-pro, claude-opus-4.8, gpt-4o, mistral-large, gemini-2.5-flash, deepseek-v3.2, gemma-3-12b, gemma-3-27b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b

I'm at my parents' house.

10×88

claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, llama-3.3-70b, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, gemma-4-e4b, apfel-foundation, qwen2.5-coder-14b

I am at my parents' house.

spaSi es necesario, vendré mañana a las nueve.

refIf necessary, I'll come at nine tomorrow.

12×70

claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, deepseek-v3.2, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, mistral-small-3.2-cloud, apfel-foundation, qwen2.5-coder-14b

If necessary, I will come tomorrow at nine.

10×83

gpt-5, gemini-2.5-pro, claude-opus-4.8, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-3-27b, gemma-3n-e4b, ollama-llama3.1-8b

If necessary, I'll come tomorrow at nine.

+2 one-off wordings (gemma-3-12b, qwen3.5-9b)

Methodology

Task. Blinded source → English. The model sees only the source sentence; references are held out for scoring.
Data. Tatoeba sentence pairs (CC-BY 2.0 FR), seeded random sample, multi-reference where available, length-filtered.
Prompt. One fixed user message, identical across models (no per-model tuning). Chain-of-thought disabled (reasoning_effort: none); translation needs none.
Structured output. Every model is constrained to emit {"translation": "…"} via a json_schema response_format. This is the equaliser: small models otherwise "think out loud" in plain text and bury the answer in preamble. Constrained decoding makes that impossible, gives every model the identical constraint, and mirrors how Lector itself prompts.
Decoding. temperature = 0 (greedy), one model resident at a time on an 18 GB host (JIT load/evict).
Metrics. chrF++ and BLEU via sacreBLEU (signatures recorded). COMET planned.

Caveats

Significance: read bands, not ranks. n=200 per language (Afrikaans largely single-reference), so per-system COMET 95% confidence intervals are roughly 1 to 2 points either way. Differences below ~1.5 COMET are sampling noise: the green leading band is a statistical tie and the sort order within it is not meaningful. Per-segment bootstrap CIs are future work.
This is a proxy. It measures general sentence MT, not Lector's actual word/phrase dictionary-lookup task, a strong signal for model choice, not "Lector's output graded."
Contamination, the big one. Tatoeba is in these models' pretraining, so a high score can reflect memorising the pair rather than reasoning about the language, and the score alone can't separate the two. The contamination-check section above bounds this with a post-cutoff holdout; treat absolute scores with suspicion and weight the pre-vs-post deltas and relative gaps over the headline numbers.
Into-English is the easy direction, and Afrikaans here is largely single-reference. Read accordingly.