About
A departmental research seminar in the KU Leuven Department of Computer Science, hosted by the Human-Computer Interaction group. One hour, one speaker, one focused argument: the way the NLP research community builds multilingual models is quietly broken, and the fix requires both better data discipline and a more honest look at what “multilingual” actually means in practice.
Kushal Tatariya is a PhD researcher at KU Leuven whose work sits at the intersection of multilingual NLP, low-resource languages, and model interpretability. His talk draws on several years of research into the gap between how the field talks about multilingual support and how multilingual models actually behave when tested on languages that are not English.
Hosted by Alek Keersmaekers. Open to students, researchers, and anyone with an interest in NLP. Hybrid format — remote participation available.
Abstract
“Research in multilingual NLP suffers from an English-centric bias, where the trend in the community is to first make things work on English, and then apply them to other languages. This can have adverse effects on the development of NLP tools in various languages as the underlying assumptions made for English may not hold true for many other languages. In my talk, I will speak about my research that looks to mitigate that bias by approaching the problems in multilingual NLP from a holistic perspective: (1) with an acute awareness of the data that is being used to train models; and (2) an explainable understanding of models that are supposed to be ‘multilingual’.”
Speaker
Kushal Jayesh Tatariya
PhD Researcher — Human-Computer Interaction Group, KU Leuven
Kushal Tatariya is a doctoral researcher at KU Leuven working on multilingual natural language processing with a focus on two interconnected problems: the quality of training data used to build multilingual models, and the interpretability of models that claim to work across languages.
His research programme cuts against a comfortable assumption in the field — that scaling multilingual models on more languages automatically produces better multilingual support. His work repeatedly finds that the quality of the data matters at least as much as its quantity. A key paper, “How Good is Your Wikipedia?”, audited Wikipedia across dozens of languages and found systematic quality differences that propagate directly into model behaviour: languages with lower “editing density” (fewer active contributors, less cross-checking) tend to have lower-quality articles, and models trained on this data perform worse on those languages — often invisibly, because benchmarks are themselves concentrated in high-resource languages.
On the interpretability side, his work examines whether the internal representations of multilingual models actually reflect the linguistic structure of each language, or whether they covertly apply English-style assumptions to other languages. The paper “Sociolinguistically Informed Interpretability” (SIGTYP 2024) tackles this through Hinglish — the Hindi-English code-mixed register widely used in South Asian online communication — showing that emotion classification in code-mixed text requires sociolinguistic awareness that current models largely lack.
Selected Papers
- How Good is Your Wikipedia? Auditing Data Quality for Low-Resource and Multilingual NLP Tatariya, Kulmizev, Poelman, Ploeger et al. — IJCNLP 2025 arxiv ›
- Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification Tatariya, Lent, Bjerva et al. — SIGTYP @ EMNLP 2024 ACL Anthology ›
- CreoleVal: Multilingual Multitask Benchmarks for Creoles Lent, Tatariya, Dabre et al. — Transactions of the ACL (TACL) 2024 TACL ›
- Pixology: Probing the Linguistic and Visual Capabilities of Pixel-Based Language Models Tatariya, Araujo, Bauwens et al. — EMNLP 2024 ACL Anthology ›
- On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility Tatariya, Poelman et al. — IJCNLP 2025 ACL Anthology ›
- Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter? Tatariya, Lent, De Lhoneux — WASSA @ ACL 2023 ACL Anthology ›
What This Talk Is About
The English-Centric Problem in NLP
The modern NLP research pipeline has a structural bias: almost everything is built in English first. Benchmark datasets are in English. Pre-training corpora are dominated by English. Evaluation protocols are designed around English linguistic properties. When researchers say a model is “multilingual”, they usually mean it was exposed to many languages during training — not that it was built to handle them equitably.
This creates a predictable failure pattern. A model trained primarily on English text will implicitly learn English-specific assumptions: how word order signals meaning, how morphology works, how sentences are structured. Applied to languages where these assumptions don’t hold — which is most of them, in different ways — the model degrades gracefully on the benchmark and catastrophically in deployment. The problem is invisible until someone builds a benchmark that actually tests the other languages rigorously.
Data Quality in Multilingual Training
Wikipedia is used as a training source for virtually every major multilingual language model, because it is large, free, structured, and available in hundreds of languages. The implicit assumption is that Wikipedia quality is roughly consistent across languages. Tatariya’s work challenges this directly.
Languages with small Wikipedia editing communities have articles that are shorter, less verified, more likely to be machine-translated from English, and less likely to reflect authentic usage. A model trained on this data does not learn the language — it learns a degraded, English-influenced approximation of it. The model then performs poorly on that language on downstream tasks, which researchers attribute to “low-resource challenges” when the source is actually bad data. Auditing data quality before training is not a minor hygiene step; it is a methodological precondition for honest multilingual evaluation.
Code-Switching — The English Assumption at Its Sharpest
Code-switching is the practice of switching between languages within a single conversation, sentence, or even word — something bilingual and multilingual speakers do naturally and constantly. Hinglish (Hindi + English) is widely used across South Asian online platforms; Spanglish across Latin American and US communities; and dozens of similar contact varieties exist globally. This is not “mixed” language in a degraded sense — it follows sociolinguistic patterns, reflects community identity, and carries meaning that neither language alone conveys.
Standard NLP models, trained on monolingual corpora and evaluated on monolingual benchmarks, handle code-switched text poorly. Tatariya’s work on Hinglish emotion classification shows that the failure is not just a data gap — it is a conceptual one. Explaining model behaviour on code-switched text requires understanding the sociolinguistic context: which language is used for which concepts, how switching signals affect, irony, or formality. Current interpretability tools, designed for monolingual English models, are largely blind to this.
Creole Languages — The Extreme Low-Resource Case
Creole languages are contact languages that developed when speakers of different languages needed to communicate: Haitian Creole (French-based), Tok Pisin (English-based), Papiamentu (Spanish/Portuguese-based), and dozens of others. They are spoken by millions of people, primarily in the Global South. They are almost entirely absent from NLP research.
The CreoleVal benchmark (TACL 2024, co-authored by Tatariya) is a systematic attempt to benchmark multilingual models on Creole languages across multiple tasks — text classification, natural language inference, machine translation. The results are not surprising: models that perform well on other low-resource languages still perform poorly on Creoles. The reasons are illuminating: Creoles are often orthographically non-standardised, underrepresented in pre-training data, and sufficiently distinct from their lexifier languages that transfer learning from those languages provides limited benefit. Building NLP systems for Creole communities requires deliberate inclusion, not an assumption that multilingual models will eventually reach everyone.
Attend
Host
KU Leuven — Department of Computer Science
Research seminar hosted by Alek Keersmaekers, within the Human-Computer Interaction group at KU Leuven’s Arenberg campus (Celestijnenlaan). The HCI group works across natural language processing, computational linguistics, and human-centred AI systems. The departmental seminar series brings in speakers on active research in language technology and machine learning.
Leuven.AI