How good are local LLMs at translation, and do you actually need the cloud? A reproducible benchmark of 24 on-device, self-hosted, and cloud models translating into English, with the low-resource case (Afrikaans) front and centre. The headline: on Afrikaans→English a local 18 GB model lands in a statistical tie with frontier cloud. Same blinded Tatoeba sentences, same prompt, greedy decoding, scored multi-reference with COMET (meaning) and chrF++ (surface). Built to pick a translation model for Lector.

Open and reproducible: harness, every model's raw outputs, and the seeded test sets: github.com/heuwels/llm-lang-eval

On Afrikaans the field is tightly bunched: 20 of 24 models fall within ~1.5 COMET (sampling noise) of the top score (≈95), a statistical tie, not a ranking. The self-hosted 18 GB gemma-4-12b-qat (95.0) sits in that band alongside frontier cloud, so for Afrikaans→English, you don't need the cloud or a big box.

Leaderboard

All 24 models across three languages on one zoomed COMET axis: each row a model (ranked by Afrikaans), one dot per language, the connector its cross-language spread. The zoom makes the tight differences legible; gaps under ~1.5 COMET are sampling noise (see Significance).

8890929496gpt-5gpt-5 · Afrikaans: 95.3gpt-5 · German: 93.1gpt-5 · Spanish: 93.9gemini-2.5-progemini-2.5-pro · Afrikaans: 95.3gemini-2.5-pro · German: 93.5gemini-2.5-pro · Spanish: 94.4claude-opus-4.8claude-opus-4.8 · Afrikaans: 95.1claude-opus-4.8 · German: 93.3claude-opus-4.8 · Spanish: 94.5claude-sonnet-4.6claude-sonnet-4.6 · Afrikaans: 95.1claude-sonnet-4.6 · German: 93.1claude-sonnet-4.6 · Spanish: 94.6gemma-4-12b-qatgemma-4-12b-qat · Afrikaans: 95.0gemma-4-12b-qat · German: 93.1gemma-4-12b-qat · Spanish: 93.6gpt-4o-minigpt-4o-mini · Afrikaans: 95.0gpt-4o-mini · German: 92.9gpt-4o-mini · Spanish: 94.3gpt-4ogpt-4o · Afrikaans: 94.9gpt-4o · German: 93.4gpt-4o · Spanish: 94.2mistral-largemistral-large · Afrikaans: 94.8mistral-large · German: 93.2mistral-large · Spanish: 94.0llama-3.3-70bllama-3.3-70b · Afrikaans: 94.7llama-3.3-70b · German: 92.9llama-3.3-70b · Spanish: 93.8gemini-2.5-flashgemini-2.5-flash · Afrikaans: 94.7gemini-2.5-flash · German: 92.4gemini-2.5-flash · Spanish: 92.4deepseek-v3.2deepseek-v3.2 · Afrikaans: 94.7deepseek-v3.2 · German: 92.6deepseek-v3.2 · Spanish: 94.0gemma-2-27bgemma-2-27b · Afrikaans: 94.5gemma-2-27b · German: 91.2gemma-2-27b · Spanish: 92.2gemma-4-12bgemma-4-12b · Afrikaans: 94.5gemma-4-12b · German: 93.0gemma-4-12b · Spanish: 93.3gemma-3-12bgemma-3-12b · Afrikaans: 94.5gemma-3-12b · German: 92.2gemma-3-12b · Spanish: 92.6gemma-3-27bgemma-3-27b · Afrikaans: 94.4gemma-3-27b · German: 92.6gemma-3-27b · Spanish: 92.9claude-haiku-4.5claude-haiku-4.5 · Afrikaans: 94.3claude-haiku-4.5 · German: 92.9claude-haiku-4.5 · Spanish: 94.4ministral-3-14bministral-3-14b · Afrikaans: 94.3ministral-3-14b · German: 91.8ministral-3-14b · Spanish: 93.5gemma-4-e4bgemma-4-e4b · Afrikaans: 94.2gemma-4-e4b · German: 92.2gemma-4-e4b · Spanish: 93.1qwen3.5-9bqwen3.5-9b · Afrikaans: 94.1qwen3.5-9b · German: 92.2qwen3.5-9b · Spanish: 93.0mistral-small-3.2-cloudmistral-small-3.2-cloud · Afrikaans: 94.0mistral-small-3.2-cloud · German: 92.8mistral-small-3.2-cloud · Spanish: 93.5gemma-3n-e4bgemma-3n-e4b · Afrikaans: 93.7gemma-3n-e4b · German: 92.3gemma-3n-e4b · Spanish: 92.9ollama-llama3.1-8bollama-llama3.1-8b · Afrikaans: 93.5ollama-llama3.1-8b · German: 91.6ollama-llama3.1-8b · Spanish: 92.6apfel-foundationapfel-foundation · Afrikaans: 92.1apfel-foundation · German: 89.6apfel-foundation · Spanish: 90.9qwen2.5-coder-14bqwen2.5-coder-14b · Afrikaans: 91.8qwen2.5-coder-14b · German: 91.4qwen2.5-coder-14b · Spanish: 93.0
Afrikaans German Spanish  ·  COMET, zoomed axis 88 to 96; dot before each name = tier

Deployment tier:on-device (laptop) self-hosted box (18 GB) cloud (OpenRouter)

The numbers

COMET (meaning, ×100) over chrF++ (surface), per language. chrF++ rewards character overlap with the reference, so it docks valid paraphrases ("scenery is magnificent" vs "landscape is breathtaking"); COMET scores meaning and credits them. Green = leading band (within ~1.5 COMET, a statistical tie, not a single winner). Rows share the chart's order (ranked by Afrikaans COMET), so near-ties can sit a place apart despite equal rounded scores. n=200 per language.

ModelAfrikaans
COMET · chrF++
German
COMET · chrF++
Spanish
COMET · chrF++
gpt-595.3chrF 83.593.1chrF 75.193.9chrF 78.1
gemini-2.5-pro95.3chrF 83.093.5chrF 75.894.4chrF 80.3
claude-opus-4.895.1chrF 82.193.3chrF 75.694.5chrF 80.5
claude-sonnet-4.695.1chrF 81.293.1chrF 74.894.6chrF 81.3
gemma-4-12b-qat95.0chrF 82.893.1chrF 74.593.6chrF 78.4
gpt-4o-mini95.0chrF 81.392.9chrF 74.094.3chrF 79.3
gpt-4o94.9chrF 81.093.4chrF 75.094.2chrF 79.3
mistral-large94.8chrF 81.893.2chrF 75.394.0chrF 78.5
llama-3.3-70b94.7chrF 82.392.9chrF 74.893.8chrF 78.1
gemini-2.5-flash94.7chrF 82.592.4chrF 75.092.4chrF 77.4
deepseek-v3.294.7chrF 80.992.6chrF 72.994.0chrF 78.3
gemma-2-27b94.5chrF 79.791.2chrF 71.492.2chrF 76.6
gemma-4-12b94.5chrF 81.593.0chrF 74.493.3chrF 77.5
gemma-3-12b94.5chrF 80.292.2chrF 72.392.6chrF 75.2
gemma-3-27b94.4chrF 79.692.6chrF 72.492.9chrF 75.7
claude-haiku-4.594.3chrF 80.292.9chrF 74.894.4chrF 79.9
ministral-3-14b94.3chrF 78.291.8chrF 71.593.5chrF 76.0
gemma-4-e4b94.2chrF 80.392.2chrF 70.993.1chrF 77.8
qwen3.5-9b94.1chrF 78.992.2chrF 71.893.0chrF 75.1
mistral-small-3.2-cloud94.0chrF 79.892.8chrF 73.793.5chrF 77.5
gemma-3n-e4b93.7chrF 77.092.3chrF 71.192.9chrF 75.0
ollama-llama3.1-8b93.5chrF 79.391.6chrF 70.492.6chrF 75.7
apfel-foundation92.1chrF 74.589.6chrF 64.190.9chrF 70.6
qwen2.5-coder-14b91.8chrF 74.391.4chrF 68.693.0chrF 76.1

Cost, and what you can actually run

Cloud models fill the top of that board, but this is a study of local models, and most of the cloud field can't run on the box at all. The frontier APIs (GPT-5, Claude, Gemini) are closed weights, so self-hosting them was never an option. The open models I could reach through OpenRouter are mostly too big for the hardware: Llama 3.3 is 70B, Mistral Large is larger again, and even the 24B and 27B open models (Mistral Small, Gemma 2 and 3 at 27B) sit at or past the ceiling of an 18 GB Mac once the OS and the KV cache take their share. What genuinely fits is the on-device and self-hosted-box tiers, so here is that field on its own.

8890929496gemma-4-12b-qatgemma-4-12b-qat · Afrikaans: 95.0gemma-4-12b-qat · German: 93.1gemma-4-12b-qat · Spanish: 93.6gemma-4-12bgemma-4-12b · Afrikaans: 94.5gemma-4-12b · German: 93.0gemma-4-12b · Spanish: 93.3gemma-3-12bgemma-3-12b · Afrikaans: 94.5gemma-3-12b · German: 92.2gemma-3-12b · Spanish: 92.6ministral-3-14bministral-3-14b · Afrikaans: 94.3ministral-3-14b · German: 91.8ministral-3-14b · Spanish: 93.5gemma-4-e4bgemma-4-e4b · Afrikaans: 94.2gemma-4-e4b · German: 92.2gemma-4-e4b · Spanish: 93.1qwen3.5-9bqwen3.5-9b · Afrikaans: 94.1qwen3.5-9b · German: 92.2qwen3.5-9b · Spanish: 93.0gemma-3n-e4bgemma-3n-e4b · Afrikaans: 93.7gemma-3n-e4b · German: 92.3gemma-3n-e4b · Spanish: 92.9ollama-llama3.1-8bollama-llama3.1-8b · Afrikaans: 93.5ollama-llama3.1-8b · German: 91.6ollama-llama3.1-8b · Spanish: 92.6apfel-foundationapfel-foundation · Afrikaans: 92.1apfel-foundation · German: 89.6apfel-foundation · Spanish: 90.9qwen2.5-coder-14bqwen2.5-coder-14b · Afrikaans: 91.8qwen2.5-coder-14b · German: 91.4qwen2.5-coder-14b · Spanish: 93.0
Afrikaans German Spanish  ·  COMET, zoomed axis 88 to 96; dot before each name = tier

The strongest model that actually fits, gemma-4-12b-qat at 7.5 GB, is the same one sitting in the frontier band up top. apfel-foundation is Apple's built-in Foundation model, the one that ships with macOS, run through the Apfel harness. It scores respectably on what it answers, but it refused or errored on 26% of the Afrikaans sentences (52 of 200), against 8% in Spanish, and Apple doesn't list Afrikaans among its supported languages, which is why it sits near the bottom.

Cost: use what you've got

Cost barely enters into it. A translation is tiny, roughly 80 tokens, so the entire cloud sweep (24 models across three languages, plus the holdout and the cloze probe) came to $13.62 on OpenRouter, well under a cent per translation even on the frontier models. Per-token pricing still spans an order or two of magnitude, the frontier APIs against the cheap tiers like Gemini Flash or GPT-4o-mini, but at this token count the absolute bill is small whichever way you go.

What moves the decision is what you already own. A spare Mac is a sunk cost, so a local model is free per lookup beyond the electricity. An existing Claude plan is free at the margin too, within its limits, which is why I reach for the Anthropic OAuth route first. OpenRouter is the only one of the three that adds a real per-token bill, and it earns its place when you need a specific model you can't self-host or don't have a plan for. So the honest answer is usually to use what you've got: a spare box runs a local model, an existing plan already covers the lookups, and with neither, the cheap cloud tier is pennies per thousand. Since a 12B you can run at home already ties the frontier on Afrikaans, paying frontier rates per token buys very little for this particular job.

Contamination check: does it survive on unseen data?

The honest limitation. Tatoeba is almost certainly in every model's pretraining, so a high score can mean "translated well" or "regurgitated a memorised pair". The score alone can't tell us which. To bound it, each model is compared on two matched 150-sentence Afrikaans samples (same length filter): pre-2023 (added 2010 to 2022, almost certainly seen in training) versus 2025-26 (added after the training cutoff of the older-generation models here, so they cannot have memorised them). A large drop on the recent set is the fingerprint of memorisation; a stable score is evidence of genuine translation ability.
Modelpre-2023
COMET
2025-26
COMET
Δ
gemma-4-12b-qat94.693.4-1.2
gemma-3-12b94.093.2-0.8
gemma-4-12b94.593.1-1.4
gemma-3n-e4b93.892.5-1.4
qwen3.5-9b93.392.1-1.2
ministral-3-14b93.892.0-1.8
gemma-4-e4b94.091.8-2.2
qwen2.5-coder-14b90.388.9-1.4

Caveat on the caveat: exact training-cutoff dates aren't published for every model, and recently-added sentences may differ subtly in style or difficulty, so read a small Δ as "holds up", not as a precise measurement of contamination.

Parroting probe: memorisation, measured directly

The sharpest contamination test. We blank one informative word per sentence and ask each model to fill it. On unseen (2025-26) sentences it can only predict from context; if it recovers the exact original word much more often on seen (pre-2023) sentences, that gap is the model parroting memorised text rather than reasoning about the language. (It doubles as a cloze-ability score, Lector's own practice task.)
Modelseen
recovery %
unseen
recovery %
gap
claude-opus-4.84231+11
qwen3.5-9b107+4
gemma-3-12b1715+2
gemma-4-12b-qat1817+1
gemma-3n-e4b67-1
gemma-4-e4b67-1

Recovery = exact match of the blanked word. A large positive gap = memorisation; near-zero = genuine context prediction. n≈150 per cell, so gaps within ~±10 are noise.

Side-by-side generations

The numbers only say so much. Here are the actual translations where models disagree most. Green marks the highest per-sentence chrF++ for that sentence.

Afrikaans: where the models split

Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.

Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
afrNiemand het gekom nie.
refNo one came.  ·  Nobody came.
12×100
gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, llama-3.3-70b, gemini-2.5-flash, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-4-e4b, gemma-3n-e4b, ollama-llama3.1-8b
Nobody came.
11×100
gpt-5, gpt-4o-mini, gpt-4o, mistral-large, deepseek-v3.2, gemma-2-27b, gemma-4-12b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, qwen2.5-coder-14b
No one came.
+1 one-off wordings (apfel-foundation)
afrDie meisie in die blou jas is my dogter.
refThe girl in the blue coat is my daughter.
11×82
gpt-4o-mini, llama-3.3-70b, deepseek-v3.2, gemma-2-27b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b
The girl in the blue jacket is my daughter.
10×100
gpt-5, gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o, mistral-large, gemini-2.5-flash, gemma-4-12b, gemma-3-12b
The girl in the blue coat is my daughter.
79
ollama-llama3.1-8b, qwen2.5-coder-14b
The girl in the blue dress is my daughter.
afrHierdie tomaties het geen smaak nie.
refThese tomatoes don't have any taste.
11×57
gpt-5, gemini-2.5-pro, claude-opus-4.8, llama-3.3-70b, gemini-2.5-flash, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b
These tomatoes have no taste.
10×44
gemma-4-12b-qat, gpt-4o-mini, gpt-4o, deepseek-v3.2, gemma-2-27b, gemma-4-12b, gemma-3-12b, gemma-3-27b, qwen3.5-9b, gemma-3n-e4b
These tomatoes have no flavor.
+3 one-off wordings (claude-sonnet-4.6, mistral-large, apfel-foundation)
afrOns het 'n groot tuin.
refWe have a big garden.  ·  We have a big yard.
14×100
gpt-5, gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, llama-3.3-70b, deepseek-v3.2, gemma-4-12b, gemma-3-12b, gemma-4-e4b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b, qwen2.5-coder-14b
We have a big garden.
10×63
gpt-4o-mini, gpt-4o, mistral-large, gemini-2.5-flash, gemma-2-27b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, qwen3.5-9b, apfel-foundation
We have a large garden.
afrDit lyk asof ons sukkel met die selfde ou probleem.
refWe seem to keep grappling with the same old problem.
10×59
gpt-5, gemini-2.5-pro, claude-sonnet-4.6, gemma-4-12b-qat, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, qwen3.5-9b, gemma-3n-e4b, ollama-llama3.1-8b
It looks like we’re struggling with the same old problem.
61
claude-opus-4.8, gpt-4o, mistral-large, llama-3.3-70b, deepseek-v3.2, gemma-3-12b, ministral-3-14b, mistral-small-3.2-cloud, apfel-foundation
It seems like we're struggling with the same old problem.
61
gemini-2.5-flash, gemma-3-27b
It seems we're struggling with the same old problem.
+3 one-off wordings (gpt-4o-mini, gemma-4-e4b, qwen2.5-coder-14b)

German: where the models split

Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.

Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
deuAlle meine Kinder wurden in Boston geboren.
refAll of my children were born in Boston.
12×87
gpt-5, gemini-2.5-pro, claude-sonnet-4.6, gpt-4o-mini, gpt-4o, gemini-2.5-flash, gemma-2-27b, gemma-3-12b, gemma-4-e4b, mistral-small-3.2-cloud, ollama-llama3.1-8b, apfel-foundation
All my children were born in Boston.
12×100
claude-opus-4.8, gemma-4-12b-qat, mistral-large, llama-3.3-70b, deepseek-v3.2, gemma-4-12b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, qwen3.5-9b, gemma-3n-e4b, qwen2.5-coder-14b
All of my children were born in Boston.
deuEr studiert an der Technischen Universität.
refHe studies at the technical university.  ·  He's studying at the technical university.
12×65
gpt-5, gemini-2.5-pro, gpt-4o-mini, gpt-4o, mistral-large, llama-3.3-70b, gemma-2-27b, gemma-3-12b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, apfel-foundation
He is studying at the Technical University.
12×73
claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, gemma-3-27b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, gemma-3n-e4b, qwen2.5-coder-14b
He studies at the Technical University.
deuDu darfst das Buch lesen.
refYou may read this book.
11×72
gpt-5, claude-opus-4.8, gemma-4-12b-qat, mistral-large, llama-3.3-70b, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, gemma-3-27b, ministral-3-14b, gemma-4-e4b
You may read the book.
11×38
gemini-2.5-pro, claude-sonnet-4.6, gpt-4o-mini, gpt-4o, gemma-2-27b, gemma-3-12b, claude-haiku-4.5, mistral-small-3.2-cloud, gemma-3n-e4b, apfel-foundation, qwen2.5-coder-14b
You are allowed to read the book.
37
qwen3.5-9b, ollama-llama3.1-8b
You're allowed to read the book.
deuNiemand hat Tom gerufen.
refNobody called Tom.
12×62
gpt-5, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, llama-3.3-70b, gemma-2-27b, gemma-4-12b, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, qwen2.5-coder-14b
No one called Tom.
11×100
gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, mistral-large, gemini-2.5-flash, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-3n-e4b, ollama-llama3.1-8b, apfel-foundation
Nobody called Tom.
+1 one-off wordings (deepseek-v3.2)
deuDie Polizisten spielten auf der Polizeistation Schach.
refThe police officers were playing chess at the police station.
13×100
gpt-5, gemma-4-12b-qat, gpt-4o, mistral-large, llama-3.3-70b, gemma-4-12b, gemma-3-12b, gemma-3-27b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b
The police officers were playing chess at the police station.
11×78
gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gpt-4o-mini, gemini-2.5-flash, deepseek-v3.2, gemma-2-27b, claude-haiku-4.5, gemma-4-e4b, gemma-3n-e4b, apfel-foundation
The police officers played chess at the police station.

Spanish: where the models split

Sentences where models split into clear camps, several agreeing on one wording, several on another (one-off wordings collapsed to a tail). Count × tier-dots per camp; green = within ~6 chrF of the closest-to-reference camp.

Why they differ: almost none of this is error. It's paraphrase choice. A contraction vs the full form, 'by the end' vs 'before the end', one valid synonym over another. Each camp diverges from the single crowd-sourced reference in its own way. The spread is widest on longer, structurally flexible sentences (more ways to order the English) and on the high-resource languages, which is exactly why the COMET (meaning) gaps are far smaller than the chrF (surface) gaps. Read the camps as equally-valid translations, not right-vs-wrong.
spaPodéis ver la televisión después de cenar.
refYou can watch television after dinner.
13×66
gpt-5, gemma-4-12b-qat, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-4-12b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b, apfel-foundation
You can watch TV after dinner.
11×100
gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gpt-4o-mini, deepseek-v3.2, gemma-2-27b, gemma-3-12b, gemma-3-27b, claude-haiku-4.5, gemma-4-e4b, qwen2.5-coder-14b
You can watch television after dinner.
spaTenía 23 años de edad cuando pinté este cuadro.
refWhen I painted this picture, I was 23 years old.
13×83
gemini-2.5-pro, claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-2-27b, gemma-3-12b, gemma-3-27b, gemma-3n-e4b
I was 23 years old when I painted this picture.
10×69
gpt-5, deepseek-v3.2, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, qwen3.5-9b, mistral-small-3.2-cloud, ollama-llama3.1-8b, qwen2.5-coder-14b
I was 23 years old when I painted this painting.
+1 one-off wordings (apfel-foundation)
spaLa pelota de golf casi entró en el hoyo.
refThe golf ball almost went in the hole.
14×90
claude-opus-4.8, claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, gpt-4o, mistral-large, gemini-2.5-flash, deepseek-v3.2, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, mistral-small-3.2-cloud, apfel-foundation, qwen2.5-coder-14b
The golf ball almost went into the hole.
10×100
gpt-5, gemini-2.5-pro, llama-3.3-70b, gemma-2-27b, gemma-3-12b, gemma-3-27b, gemma-4-e4b, qwen3.5-9b, gemma-3n-e4b, ollama-llama3.1-8b
The golf ball almost went in the hole.
spaEstoy en casa de mis padres.
refI'm at my parents' house.
14×100
gpt-5, gemini-2.5-pro, claude-opus-4.8, gpt-4o, mistral-large, gemini-2.5-flash, deepseek-v3.2, gemma-3-12b, gemma-3-27b, ministral-3-14b, qwen3.5-9b, mistral-small-3.2-cloud, gemma-3n-e4b, ollama-llama3.1-8b
I'm at my parents' house.
10×88
claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, llama-3.3-70b, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, gemma-4-e4b, apfel-foundation, qwen2.5-coder-14b
I am at my parents' house.
spaSi es necesario, vendré mañana a las nueve.
refIf necessary, I'll come at nine tomorrow.
12×70
claude-sonnet-4.6, gemma-4-12b-qat, gpt-4o-mini, deepseek-v3.2, gemma-2-27b, gemma-4-12b, claude-haiku-4.5, ministral-3-14b, gemma-4-e4b, mistral-small-3.2-cloud, apfel-foundation, qwen2.5-coder-14b
If necessary, I will come tomorrow at nine.
10×83
gpt-5, gemini-2.5-pro, claude-opus-4.8, gpt-4o, mistral-large, llama-3.3-70b, gemini-2.5-flash, gemma-3-27b, gemma-3n-e4b, ollama-llama3.1-8b
If necessary, I'll come tomorrow at nine.
+2 one-off wordings (gemma-3-12b, qwen3.5-9b)

Methodology

Caveats