Which local LLM translates best? A reproducible eval
2026-06-19 · source → English · Afrikaans / German / Spanish
How good are local LLMs at translation, and do you actually need the cloud?
A reproducible benchmark of 24 on-device, self-hosted, and cloud models translating into English, with the
low-resource case (Afrikaans) front and centre. The headline: on Afrikaans→English a local 18 GB
model lands in a statistical tie with frontier cloud. Same blinded Tatoeba sentences, same prompt, greedy
decoding, scored multi-reference with COMET (meaning) and chrF++ (surface).
Built to pick a translation model for Lector.
On Afrikaans the field is tightly bunched: 20 of 24 models fall within ~1.5 COMET (sampling noise) of the top score (≈95), a statistical tie, not a ranking. The self-hosted 18 GB gemma-4-12b-qat (95.0) sits in that band alongside frontier cloud, so for Afrikaans→English, you don't need the cloud or a big box.
Leaderboard
All 24 models across three languages on one zoomed COMET axis: each row a model
(ranked by Afrikaans), one dot per language, the connector its cross-language spread. The zoom makes
the tight differences legible; gaps under ~1.5 COMET are sampling noise (see Significance).
AfrikaansGermanSpanish · COMET, zoomed axis 88 to 96; dot before each name = tier
COMET (meaning, ×100) over chrF++ (surface), per language. chrF++
rewards character overlap with the reference, so it docks valid paraphrases
("scenery is magnificent" vs "landscape is breathtaking"); COMET
scores meaning and credits them. Green = leading band (within ~1.5 COMET, a
statistical tie, not a single winner). Rows share the chart's order (ranked by Afrikaans COMET), so
near-ties can sit a place apart despite equal rounded scores. n=200 per language.
Model
Afrikaans COMET · chrF++
German COMET · chrF++
Spanish COMET · chrF++
gpt-5
95.3chrF 83.5
93.1chrF 75.1
93.9chrF 78.1
gemini-2.5-pro
95.3chrF 83.0
93.5chrF 75.8
94.4chrF 80.3
claude-opus-4.8
95.1chrF 82.1
93.3chrF 75.6
94.5chrF 80.5
claude-sonnet-4.6
95.1chrF 81.2
93.1chrF 74.8
94.6chrF 81.3
gemma-4-12b-qat
95.0chrF 82.8
93.1chrF 74.5
93.6chrF 78.4
gpt-4o-mini
95.0chrF 81.3
92.9chrF 74.0
94.3chrF 79.3
gpt-4o
94.9chrF 81.0
93.4chrF 75.0
94.2chrF 79.3
mistral-large
94.8chrF 81.8
93.2chrF 75.3
94.0chrF 78.5
llama-3.3-70b
94.7chrF 82.3
92.9chrF 74.8
93.8chrF 78.1
gemini-2.5-flash
94.7chrF 82.5
92.4chrF 75.0
92.4chrF 77.4
deepseek-v3.2
94.7chrF 80.9
92.6chrF 72.9
94.0chrF 78.3
gemma-2-27b
94.5chrF 79.7
91.2chrF 71.4
92.2chrF 76.6
gemma-4-12b
94.5chrF 81.5
93.0chrF 74.4
93.3chrF 77.5
gemma-3-12b
94.5chrF 80.2
92.2chrF 72.3
92.6chrF 75.2
gemma-3-27b
94.4chrF 79.6
92.6chrF 72.4
92.9chrF 75.7
claude-haiku-4.5
94.3chrF 80.2
92.9chrF 74.8
94.4chrF 79.9
ministral-3-14b
94.3chrF 78.2
91.8chrF 71.5
93.5chrF 76.0
gemma-4-e4b
94.2chrF 80.3
92.2chrF 70.9
93.1chrF 77.8
qwen3.5-9b
94.1chrF 78.9
92.2chrF 71.8
93.0chrF 75.1
mistral-small-3.2-cloud
94.0chrF 79.8
92.8chrF 73.7
93.5chrF 77.5
gemma-3n-e4b
93.7chrF 77.0
92.3chrF 71.1
92.9chrF 75.0
ollama-llama3.1-8b
93.5chrF 79.3
91.6chrF 70.4
92.6chrF 75.7
apfel-foundation
92.1chrF 74.5
89.6chrF 64.1
90.9chrF 70.6
qwen2.5-coder-14b
91.8chrF 74.3
91.4chrF 68.6
93.0chrF 76.1
Cost, and what you can actually run
Cloud models fill the top of that board, but this is a study of local models, and most of
the cloud field can't run on the box at all. The frontier APIs (GPT-5, Claude, Gemini) are closed weights,
so self-hosting them was never an option. The open models I could reach through OpenRouter are mostly too
big for the hardware: Llama 3.3 is 70B, Mistral Large is larger again, and even the 24B and 27B open models
(Mistral Small, Gemma 2 and 3 at 27B) sit at or past the ceiling of an 18 GB Mac once the OS and the KV
cache take their share. What genuinely fits is the on-device and self-hosted-box tiers, so here is that
field on its own.
AfrikaansGermanSpanish · COMET, zoomed axis 88 to 96; dot before each name = tier
The strongest model that actually fits, gemma-4-12b-qat at 7.5 GB, is the
same one sitting in the frontier band up top. apfel-foundation is Apple's built-in Foundation model, the one that ships with macOS, run through the Apfel harness. It scores respectably on what it answers, but it refused or errored on 26% of the Afrikaans sentences (52 of 200), against 8% in Spanish, and Apple doesn't list Afrikaans among its supported languages, which is why it sits near the bottom.
Cost: use what you've got
Cost barely enters into it. A translation is tiny, roughly 80 tokens, so the entire cloud
sweep (24 models across three languages, plus the holdout and the cloze probe) came to $13.62 on OpenRouter,
well under a cent per translation even on the frontier models. Per-token pricing still spans an order or two
of magnitude, the frontier APIs against the cheap tiers like Gemini Flash or GPT-4o-mini, but at this token
count the absolute bill is small whichever way you go.
What moves the decision is what you already own. A spare Mac is a sunk cost, so a local model
is free per lookup beyond the electricity. An existing Claude plan is free at the margin too, within its
limits, which is why I reach for the Anthropic OAuth route first. OpenRouter is the only one of the three
that adds a real per-token bill, and it earns its place when you need a specific model you can't self-host or
don't have a plan for. So the honest answer is usually to use what you've got: a spare box runs a local
model, an existing plan already covers the lookups, and with neither, the cheap cloud tier is pennies per
thousand. Since a 12B you can run at home already ties the frontier on Afrikaans, paying frontier rates per
token buys very little for this particular job.
Contamination check: does it survive on unseen data?
The honest limitation. Tatoeba is almost certainly in every model's
pretraining, so a high score can mean "translated well" or "regurgitated a memorised pair". The
score alone can't tell us which. To bound it, each model is compared on two matched 150-sentence Afrikaans
samples (same length filter): pre-2023 (added 2010 to 2022, almost certainly seen in training)
versus 2025-26 (added after the training cutoff of the older-generation models here, so they
cannot have memorised them). A large drop on the recent set is the fingerprint of memorisation; a stable
score is evidence of genuine translation ability.
Model
pre-2023 COMET
2025-26 COMET
Δ
gemma-4-12b-qat
94.6
93.4
-1.2
gemma-3-12b
94.0
93.2
-0.8
gemma-4-12b
94.5
93.1
-1.4
gemma-3n-e4b
93.8
92.5
-1.4
qwen3.5-9b
93.3
92.1
-1.2
ministral-3-14b
93.8
92.0
-1.8
gemma-4-e4b
94.0
91.8
-2.2
qwen2.5-coder-14b
90.3
88.9
-1.4
Caveat on the caveat: exact training-cutoff dates aren't published for every model, and
recently-added sentences may differ subtly in style or difficulty, so read a small Δ as "holds up", not as a
precise measurement of contamination.
Parroting probe: memorisation, measured directly
The sharpest contamination test. We blank one informative word per
sentence and ask each model to fill it. On unseen (2025-26) sentences it can only predict
from context; if it recovers the exact original word much more often on seen (pre-2023)
sentences, that gap is the model parroting memorised text rather than reasoning about the language. (It
doubles as a cloze-ability score, Lector's own practice task.)
Model
seen recovery %
unseen recovery %
gap
claude-opus-4.8
42
31
+11
qwen3.5-9b
10
7
+4
gemma-3-12b
17
15
+2
gemma-4-12b-qat
18
17
+1
gemma-3n-e4b
6
7
-1
gemma-4-e4b
6
7
-1
Recovery = exact match of the blanked word. A large positive gap = memorisation; near-zero
= genuine context prediction. n≈150 per cell, so gaps within ~±10 are noise.
Side-by-side generations
The numbers only say so much. Here are the actual translations where models
disagree most. Green marks the highest per-sentence chrF++ for that sentence.
Afrikaans: where the models split
Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.
Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.
Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
The police officers played chess at the police station.
Spanish: where the models split
Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.
Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
Task. Blinded source → English. The model sees only the source sentence; references are held out for scoring.
Data.Tatoeba sentence pairs (CC-BY 2.0 FR), seeded random sample, multi-reference where available, length-filtered.
Prompt. One fixed user message, identical across models (no per-model tuning). Chain-of-thought disabled (reasoning_effort: none); translation needs none.
Structured output. Every model is constrained to emit {"translation": "…"} via a json_schema response_format. This is the equaliser: small models otherwise "think out loud" in plain text and bury the answer in preamble. Constrained decoding makes that impossible, gives every model the identical constraint, and mirrors how Lector itself prompts.
Decoding.temperature = 0 (greedy), one model resident at a time on an 18 GB host (JIT load/evict).
Metrics. chrF++ and BLEU via sacreBLEU (signatures recorded). COMET planned.
Caveats
Significance: read bands, not ranks. n=200 per language (Afrikaans largely single-reference), so per-system COMET 95% confidence intervals are roughly 1 to 2 points either way. Differences below ~1.5 COMET are sampling noise: the green leading band is a statistical tie and the sort order within it is not meaningful. Per-segment bootstrap CIs are future work.
This is a proxy. It measures general sentence MT, not Lector's actual word/phrase dictionary-lookup task, a strong signal for model choice, not "Lector's output graded."
Contamination, the big one. Tatoeba is in these models' pretraining, so a high score can reflect memorising the pair rather than reasoning about the language, and the score alone can't separate the two. The contamination-check section above bounds this with a post-cutoff holdout; treat absolute scores with suspicion and weight the pre-vs-post deltas and relative gaps over the headline numbers.
Into-English is the easy direction, and Afrikaans here is largely single-reference. Read accordingly.