Want inclusive AI? Teach it to speak more languages
Uyi Stewart of data.org unveils a new global coalition that will focus on digitizing languages and building local capacity to ensure AI benefits the global majority, and that LLMs become more accurate and robust in the process.
By Catherine Cheney // 27 September 2024The rapid development of artificial intelligence is leaving low- and middle-income countries behind, due in large part to the lack of digitization of the vast majority of the world’s languages. While there are over 7,000 languages spoken globally, AI models, and particularly large language models, or LLMs, are predominantly trained on just two languages — English and Mandarin. This disparity creates a new kind of digital divide, which is not about infrastructure or the internet, but rather a data divide. “This technology is data hungry, and the developers of this technology therefore are voracious and looking for where to get the most available data. Well, the most available data is on the internet,” said Uyi Stewart, chief data and technology officer at data.org, a nonprofit organization committed to using data and AI to address society's most pressing challenges. “Languages that are available on the internet are now called high-resource languages, and languages that are not available on the internet are called low-resource languages.” Right now, he said, because low-resource languages are not available on the internet, AI models cannot access them. Speaking at Devex’s event on the sidelines of the 79th session of the U.N. General Assembly, he announced the launch of a global coalition aimed at making AI more equitable. Set to launch Friday with support from the Mastercard Center for Inclusive Growth, the coalition will focus on digitizing languages and building local capacity to ensure AI benefits the global majority. Stewart highlighted this initiative as a critical step in reimagining data collection methods, especially in regions with primarily oral languages, to make AI more inclusive. He said that collecting data in more languages could also help tackle problems that currently exist, such as when AI generates responses that are irrelevant, invented, or inaccurate. “LLMs and foundational models will continue to hallucinate,” Stewart said. “Why? Because developers of this technology have skipped an important step that requires local contribution.” By incorporating spoken languages in data collection, developers of AI models can also begin to capture the emotional and cultural context that is often missing in text based models, Stewart said. In a range of low- and middle-income country contexts, such as Nigeria, where Stewart is from, data.org has been partnering with universities, innovation hubs, and governments to build local ecosystems for AI development. The goal is to train local citizens to contribute to digitization efforts, Stewart said. Localizing data, language and skills is critical to ensure that AI serves the global majority, and it will also lead to more robust and accurate AI models. Stewart closed his talk by referring to the printing press, and the ways it democratized knowledge. “That’s the model we're trying to replicate here, and that’s what we’re calling for this global coalition around digitization, so that we can get to the next wave of our evolution,” he said.
The rapid development of artificial intelligence is leaving low- and middle-income countries behind, due in large part to the lack of digitization of the vast majority of the world’s languages.
While there are over 7,000 languages spoken globally, AI models, and particularly large language models, or LLMs, are predominantly trained on just two languages — English and Mandarin.
This disparity creates a new kind of digital divide, which is not about infrastructure or the internet, but rather a data divide.
This article is free to read - just register or sign in
Access news, newsletters, events and more.
Join usSign inPrinting articles to share with others is a breach of our terms and conditions and copyright policy. Please use the sharing options on the left side of the article. Devex Pro members may share up to 10 articles per month using the Pro share tool ( ).
Catherine Cheney is the Senior Editor for Special Coverage at Devex. She leads the editorial vision of Devex’s news events and editorial coverage of key moments on the global development calendar. Catherine joined Devex as a reporter, focusing on technology and innovation in making progress on the Sustainable Development Goals. Prior to joining Devex, Catherine earned her bachelor’s and master’s degrees from Yale University, and worked as a web producer for POLITICO, a reporter for World Politics Review, and special projects editor at NationSwell. She has reported domestically and internationally for outlets including The Atlantic and the Washington Post. Catherine also works for the Solutions Journalism Network, a non profit organization that supports journalists and news organizations to report on responses to problems.