Fine-Tuning LLMs on Arabic Gives No Edge to Related Languages, Task Alignment Explains Gains

A study of 7 LLMs from 4B to 671B parameters finds fine-tuning on Arabic improves zero-shot reading comprehension equally on Semitic and non-Semitic languages, pointing to task-format learning rather than cross-lingual...

cross lingual transferlarge language modelsfine tuningzero shot reading comprehensionchain of thoughtmixed architectures

Across 7 large language models and two architectures, fine-tuning on Arabic produces zero cross-lingual transfer to related Semitic languages. The gains come entirely from learning how to answer reading comprehension questions.

Why Linguistic Relatedness Doesn't Matter

The experiment is clean: fine-tune seven LLMs (from 4B to 671B parameters, covering both dense and Mixture-of-Experts architectures) on Arabic, then test zero-shot on Semitic languages like Hebrew and Amharic plus non-Semitic controls like Turkish and English. If linguistic relatedness mattered, Semitic languages should show bigger improvements. They don't.

Models that start with weak baseline scores improve dramatically across all languages, regardless of family. Models that already score well show only marginal gains, again uniform across languages. The pattern holds for every architecture tested. This is a strong signal that fine-tuning teaches task alignment (how to produce the answer format) rather than transferring knowledge about Arabic grammar or vocabulary to cognate languages.

What the Ablation Reveals

Chain-of-thought reasoning without any fine-tuning produces the same pattern. The models that benefit most from fine-tuning also benefit most from inference-time chain-of-thought, and the magnitude of improvement correlates. Both mechanisms address the same bottleneck: understanding the reading comprehension task format. Neither mechanism transfers language-specific knowledge.

This result challenges a core assumption in multilingual NLP. If you thought fine-tuning on a high-resource language like Arabic would bootstrap understanding of low-resource Semitic languages, your money is on the wrong mechanism. The models learn to better parse questions and locate answer spans, not to map Arabic lexicons onto Hebrew or Amharic.

Future work on cross-lingual transfer should focus on explicit knowledge injection or alignment across language families, because fine-tuning alone isn't doing what we thought it was.

Source: Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer
Domain: arxiv.org

Fine-Tuning LLMs on Arabic Gives No Edge to Related Languages, Task Alignment Explains Gains

Why Linguistic Relatedness Doesn't Matter

What the Ablation Reveals

More in Artificial Intelligence