Basab Jha
Large Language Models (LLMs) like GPT-4 and mBERT have revolutionized natural language processing (NLP) by providing multilingual capabilities, making it possible to develop models that handle diverse linguistic inputs across various languages. However, despite these advances, there remains a noticeable performance gap between how well these models perform in high-resource languages such as English and low-resource languages such as Nepali or Malagasy. We term this phenomenon the "Babel Effect," highlighting the disproportionate performance that arises from differences in resource availability across languages. This paper aims to explore the root causes of these performance discrepancies in LLMs, focusing on the underlying challenges in tokenization, training, and data scarcity. We utilize cross-lingual benchmarks, such as XGLUE and TyDiQA, to quantify these performance variations and examine them in detail. Furthermore, we propose solutions, including enhancing tokenization strategies, employing data augmentation techniques, and refining fine-tuning methods. The paper concludes with a discussion on how these improvements can mitigate the Babel Effect and lead to more equitable language modeling across diverse linguistic contexts.