Opinion: Localizing AI through languages is a 2025 imperative
The “Predictions for Global Development” series offers insight from thought leaders for the year ahead. In the field of AI, expect to see large language models developed in languages other than English and Mandarin.
By Uyi Stewart // 16 December 2024As we look toward 2025, linguistic inclusivity in artificial intelligence development will become increasingly urgent. As AI globally transforms life and society, two things are confounding. The first is the few big technology companies driving AI advancement paying lip service about “AI serving the needs of everyone, everywhere.” The second is the global development community embracing an exclusive and unrepresentative AI technology that is driven predominantly by English and Mandarin, to the detriment of over 7,000 languages spoken by about 5 billion people across the community it seeks to serve. When a woman, who is part of a marginalized community, with low levels of education and living in poverty, wakes up to find a lump in her breast, she may not know what her next step should be. She is fluent in her native language but does not speak English or Mandarin. How can AI help her find her way through the daunting maze of medical diagnosis until treatment? Until AI can “speak” the languages of these vulnerable communities, its potential to advance the Sustainable Development Goals, reduce global disease burdens, and address global inequities will remain limited. Given this, I believe that in the coming year, we can expect a growing recognition of the critical gaps in current AI language approaches. Representation matters, and for global development professionals working in the field of AI, I would like to outline four key considerations for localizing AI through languages. 1. There are no short cuts We will likely see increased momentum by philanthropies toward a more nuanced approach to making AI work for marginalized and vulnerable communities. Some are proposing replicating the pharma vaccine development model by investing in Big Tech to create adaptable large language models, or LLMs. As a result, they are making investments in big tech companies who are driving AI systems in English or Mandarin to create modules in their large language model development pipelines that can be adapted to the languages across their geographical footprint. This is just the start, however. The resulting LLMs will remain insufficient to support the actual implementation of interventions (usable solutions) for the SDGs because they cannot truly “speak” the languages of billions of people in local communities. Moreover, looking more closely at the vaccine development models — these benefit marginalized and vulnerable communities because of initiatives like Gavi, the Vaccine Alliance, anchored by public-private partnership. A similar platform is needed to make AI developed by big tech to work for SDG interventions. By de-risking the investments required for such a platform to localize and adapt these large language models, philanthropies will not only help to leverage their technical efforts but will, eventually, help to align the incentives of Big Tech wanting to make AI work for everyone, everywhere, with the aspirations of governments seeking to enable their communities to contribute to and benefit from AI technology. The saying “a word is enough for the wise” applies here, as more governments are forming AI committees who are prioritizing content in local languages tailored to their contexts and priorities — for example, the Nigerian Multilingual Large Language Model. This will only intensify as AI continues to trend. 2. Reimagining data collection The coming year will likely highlight the need to scrutinize the quality, representativeness, and completeness of the underlying data on which the current LLMs are trained. The acquisition of massive amounts of textual or digitized data — primarily from the internet — leads to biases in AI models. This also creates a new kind of digital divide — a data divide whereby languages available online are termed high-resource, while those absent are called low-resource languages. This, invariably, exposes a gaping hole in the development of AI models to cater to languages that are considered low-resourced. As I’ve previously stated in a Devex article, a revised approach is required. One that promotes the digitization of these languages (bringing them online) through data collection playbooks aimed at capturing speech data for these predominantly oral languages. For example, data.org is partnering with Karya Inc, with support from the Mastercard Center for Inclusive Growth, to create a playbook for the digitization of 10 languages in India — Bhojpuri, Konkani, Dogri, Kashmiri, Sindhi, Manipuri, Tripuri, Mizo, Bodo, and Santali — spoken by over 100 million people. Similarly, in Africa, data.org is also partnering with Data Science Nigeria, University of Lagos, and University of Pretoria, starting with two Pan-African languages, Yoruba and Hausa, spoken by over 100 million people. For many of these undigitized oral languages, tones are meaning-bearing. On most keyboards, the keystrokes for these semantic markers are not present. In many African languages, you find words that are spelled the same way but have different meanings based on the tonal inflection. If these distinctions are not preserved in the corpus for training the LLMs, the resulting models will be inaccurate and insufficient. This underscores the need to collect speech data to capture these critical characteristics of languages that are lost from using internet data and in languages that are not digitized in the first place. 3. Viewing AI as a sociotechnical system Expect growing discourse on designing AI systems based on social and technical considerations to benefit society. When AI systems are developed mainly on text data, they miss critical social features of language including worldviews, beliefs, culture, and lived experiences. Currently, AI model developers use prompt engineering — i.e., identifying the variations in the questions that people can ask of the model — combined with text augmentation, human alignment, etc., to improve their models to encode the social features of language required for effective communication. Unfortunately, based on their poor understanding of idioms, proverbs, symbolisms, and nuanced communication that are crucial in many undigitized languages, these AI systems struggle with communicative performance, i.e., communication that effectively and appropriately meets the needs of stakeholders in each situation. As such, there needs to be a concerted and intentional effort on the development of language corpora for AI systems that goes beyond the current practice of text augmentation and human alignment, to ensure that these large language models capture and encode the variations in different contexts such as historical, regional, cultural, and sociolinguistic. Recently, I interviewed a few people to take care of a sick relative in my native country, Nigeria. One of the applicants opened with a special form of greeting that is part of the culture of the Edo Kingdom in southern Nigeria. Greetings are encoded into words based on kinship or lineage. This is not obvious to outsiders but when this applicant greeted me in my own special word form, I felt a kindred spirit and trust right away. This is communicative performance and brings me to my final consideration. 4 . Building local capacity to support the development of AI In the coming year and beyond, we will see an increased focus on developing AI knowledge bases from personal or lived experiences. Communities should have the capacity to develop AI solutions that reflect their specific contexts. Initiatives like data.org’s Capacity Accelerator Network can fill gaps in linguistic corpora for LLMs and help build trust with communities. We must all be intentional about democratizing AI development by empowering local communities to capture and contribute local datasets that honor and incorporate Indigenous knowledge, among other things. Localizing data, language, and skills is crucial to ensuring AI meets the needs of billions who don't speak English or Mandarin, while also creating more robust and accurate AI models. Localizing AI is not just a necessity, it's also a transformative step toward equity and inclusion for all. A word is enough for the wise.
As we look toward 2025, linguistic inclusivity in artificial intelligence development will become increasingly urgent.
As AI globally transforms life and society, two things are confounding. The first is the few big technology companies driving AI advancement paying lip service about “AI serving the needs of everyone, everywhere.” The second is the global development community embracing an exclusive and unrepresentative AI technology that is driven predominantly by English and Mandarin, to the detriment of over 7,000 languages spoken by about 5 billion people across the community it seeks to serve.
When a woman, who is part of a marginalized community, with low levels of education and living in poverty, wakes up to find a lump in her breast, she may not know what her next step should be. She is fluent in her native language but does not speak English or Mandarin. How can AI help her find her way through the daunting maze of medical diagnosis until treatment? Until AI can “speak” the languages of these vulnerable communities, its potential to advance the Sustainable Development Goals, reduce global disease burdens, and address global inequities will remain limited.
This article is free to read - just register or sign in
Access news, newsletters, events and more.
Join usSign inPrinting articles to share with others is a breach of our terms and conditions and copyright policy. Please use the sharing options on the left side of the article. Devex Pro members may share up to 10 articles per month using the Pro share tool ( ).
The views in this opinion piece do not necessarily reflect Devex's editorial views.
Uyi Stewart is the chief data and technology officer at data.org, where he oversees the delivery of programmatic initiatives to accelerate the power of data and AI to solve some of our pressing global challenges. He holds a doctorate in Linguistics, with about 25 years’ experience advancing data for social impact in both public and private sectors.