Source linked

LLMs Flunk Romanisiert Indisch-Englisch: Neuer Benchmark zeigt Code-Mixing Blind Spot

Indi-RomCoM testet 7 Anweisungsnachfolgeaufgaben in 4 indischen Sprachen bei 3 Mischintensitäten; die Leistung sinkt stark, wenn die Code-Mischdichte steigt.

indi romcomcode mixed benchmarkllm evaluationromanized indic languagesmultilingual aiarxiv

Proprietary, open-weight, and Indic-focused LLMs all choke on Romanized code-mixed instructions—and the denser the mixing, the worse they get. The Indi-RomCoM benchmark, introduced on arXiv (2606.30790), covers 7 instruction-following tasks across Hindi, Bengali, Marathi, and Tamil at 3 controlled code-mixing intensity levels. Zero-shot and few-shot evaluations show a consistent degradation curve: more romanized native words spliced with English means more failures.

Why LLMs Struggle with Romanized Code-Mixing

Romanized Code Mixing (RCM) is the dominant daily communication style for hundreds of millions of bilingual speakers—think "kal movie dekhni hai, bahut excited hoon." Yet LLMs trained on clean monolingual or native-script data lack the statistical patterns of this hybrid script. The benchmark reveals that even models fine-tuned on Indic languages underperform when the same language is written in Roman script with English insertions. Reasoning tasks like instruction following degrade less than detection tasks (e.g., toxicity classification), because generating explanations provides contextual crutches that simple classifiers don't get.

What the Benchmarks Reveal

Indi-RomCoM slices performance by task type, language, and code-mixing level. Detection tasks suffer the steepest accuracy drops as mixing intensity increases—up to 30% relative degradation in some configurations. Reasoning tasks hold up better but still show clear negative correlation with mixing density. No model class escapes: proprietary APIs, open-weight LLaMA variants, and Indic-specialized models like AI4Bharat’s IndicBERT all exhibit the blind spot. The authors report that few-shot prompting helps marginally but does not close the gap.

If you build products for India’s internet users, this benchmark is a direct measure of where your LLM pipeline leaks. Indi-RomCoM provides the granularity to diagnose failures per language and per task, making it a practical tool for targeted data collection and fine-tuning—not just another academic leaderboard. The next step is figuring out whether synthetic RCM data or script-aware tokenization can bridge the gap without exploding training costs.


Source: Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.