top of page
Search

Multilingual AI Benchmarking: Why Even the Best Models Struggle Outside English

Abstract dark-blue digital artwork featuring glowing blue and violet light threads flowing across the center, intersecting around subtle characters from multiple writing systems — including Arabic, Latin, Chinese, and Korean — symbolizing multilingual data and AI connectivity.

Based on research from the University of Maryland, Microsoft, and UMass Amherst (COLM 2025)


The illusion of universal intelligence


We often talk about AI as if it “understands everything.”

In reality, most large language models (LLMs) — even the latest generation — still think in English.


A new benchmark called ONERULER challenges that illusion. Developed by researchers from the University of Maryland, Microsoft, and UMass Amherst, it evaluates how well AI models handle long documents in 26 languages.


The results are eye-opening:


English isn’t even in the top 5. Polish ranks number 1 with ~88 % accuracy. And the performance gap between high- and low-resource languages keeps growing the longer the document gets.

What Multilingual AI Benchmarking Reveals


ONERULER is a multilingual adaptation of the RULER benchmark, originally designed for English-only testing.

It measures how well models can retrieve and reason across extremely long contexts — think of reading an entire book or legal contract and finding one specific detail (“needle-in-a-haystack” tasks).


The benchmark covers 26 languages, from English, Spanish, and German to Hindi, Swahili, Tamil, and Sesotho. It simulates real-world situations where context and precision matter — summarizing policies, auditing compliance reports, or answering questions buried in long documents.


The key findings


The study tested several leading open- and closed-weight models, including OpenAI’s o3-mini-high, Google’s Gemini 1.5 Flash, Llama 3.3, and Qwen 2.5.

The headline results tell a clear story:

  • Polish was the top-performing language on long-context tasks (64k–128k tokens) with ~88 % accuracy.

  • English ranked 6th (~83.9 %), while Chinese ranked 4th worst (~62.1 %).

  • Low-resource languages (Swahili, Tamil, Hindi, Sesotho) performed significantly worse — and the gap widened from 11 % at 8 K tokens to 34 % at 128 K.

  • When given the option to answer “none,” models often falsely claimed no answer existed — even when it did.


These results reveal how language coverage, tokenizer design, and dataset diversity still shape AI reliability today.


Why Multilingual AI Benchmarking Matters for Business and Security


For global teams and data-driven businesses, this isn’t an academic issue — it’s a risk surface.


When your AI systems process multilingual data (contracts, reports, customer communications), they rely on assumptions built into their training data. If the model’s understanding drops by 30 % just because the document is in Swahili or Polish, that’s not just a translation problem — it’s a trust problem.


In the context of privacy, compliance, or workflow automation, that gap can mean:

  • Misinterpreting local regulatory clauses (e.g. GDPR translations).

  • Failing to detect policy inconsistencies in non-English documents.

  • Incorrectly flagging “no issue” due to the “none” bias observed in the benchmark.


The broader takeaway?

AI doesn’t fail uniformly — it fails selectively, depending on your language, data, and context length.


The “none” problem: when AI is too confident about nothing


One of the study’s most intriguing results comes from a simple experiment.

The researchers modified the prompt to include this line:


“If no such number exists, please answer ‘none.’”

Adding that single sentence dropped model accuracy by 32 % at 128 K tokens (English).


Many models, including reasoning-optimized ones like o3-mini-high, became overly cautious — insisting that no answer existed even when it clearly did.


For real-world use cases, this reflects a deeper issue: AI systems are often better at guessing something than admitting uncertainty.

In data protection or automation workflows, that can translate into missed alerts, incomplete risk audits, or false compliance confidence.


Lessons for multilingual AI adoption


  1. Audit your AI’s language coverage.

Don’t assume your English-tested model performs equally well in other languages. Use multilingual evaluation datasets or local test samples.

  1. Validate automations in the actual language of use.

A workflow automation that works perfectly in English Slack messages might fail silently in Spanish or Arabic threads.

  1. Use cross-lingual redundancy for critical checks.

Running parallel prompts in two languages can expose inconsistencies and prevent silent failure.

  1. Demand transparency from vendors.

Ask what languages were included in fine-tuning and how long-context capabilities are measured.

  1. Stay human-in-the-loop.

Especially for privacy and compliance tasks, combine automated summaries with manual review to ensure context isn’t lost in translation.


Building trustworthy multilingual AI


At PrivateData, we believe that automation and AI should work around people, not the other way around.

This research is a reminder that fairness, context, and transparency are inseparable from performance.


As AI expands into multilingual, cross-border workflows, benchmarks like ONERULER provide the measurement tools we need — not just to compare models, but to build systems that are truly global, ethical, and reliable.


Sources:


Comments


bottom of page