RAI Session: Moving Beyond English

The use of Large Language Models (LLMs) to automate natural language processing (NLP) tasks has flourished in English-speaking regions due to increased computing power as well as a surplus of digitized English-language data. The same cannot be said for a majority of the over 7,000 languages spoken globally, driving stark performance differences in LLMs between English and non-English languages. Furthermore, there are also often limitations in AI safety evaluations and mitigations applied to non-English languages. By expanding LLM capacity of language generation, knowledge utilization, and complex reasoning to include non-English languages through the increased digitization of low-resource language data, innovations in model architectures can help to make LLMs, and generative AI systems more broadly, more applicable and accessible to communities where low resource languages are widely spoken including those in the Global South.

On September 19, 2024, fellows participated in a discussion on the topic of “AI: Moving beyond English.” In order to seed the conversation, three fellows presented concrete case studies and research findings that highlighted existing efforts to bridge the divide in linguistic representation within LLMs highlighting best practices to improve the performance of AI systems in non-English languages.

Discussion Summary

The central focus of the second discussion session was to identify potential approaches that make LLMs more accessible to non-English languages, particularly in communities of the Global South. Presenters were quick to note that, due to increased globalization, urbanization, and migration, people around the world have been systematically incentivized to learn how to speak and write in English in pursuit of better economic opportunities. Furthermore, English has become the de facto language on the internet and much of the data used to train LLMs and other multimodal generative AI systems has been in English. For example, Common Crawl statistics consistently find the English language to be anywhere from 43 to 46% of their language training datasets, with most other high-resource languages ranking between five and ten percent. This is not to say that LLMs that perform best in the English language are without limitations. These models often face difficulty in terms of efficiency, reasoning, bias, safety, and security; these limitations, however, are further adverse for non-English models.

To better understand the resource disparity and rise of English dominance in AI, presenters referenced recent studies facilitated by the International Telecommunication Union (via World Bank) and GSMA Intelligence, mapping internet usage and penetration globally and by region. Per the International Telecommunication Union study, countries in the Global North and Oceania have the highest percentage of the population accessing the Internet while countries across Africa have the lowest. The GSMA Intelligence study detailed the percentage of the population in global regions connected to the internet compared to the percentages of those unable to connect. Europe is highest-ranking, with 85% of the continent’s population connected to the internet and only one percent without the means to be online – allotting for a fourteen percent usage gap. Comparatively, in Sub-Saharan Africa, only 23% of the population is connected to the internet with seventeen percent unable to – allotting for a usage gap of sixty percent.

English is consistently the most popular language on the internet despite being only the third most used language globally. A November 2023 study from AI4D shows that the share of internet content in the English language sits as high as 52.6% with Afrikaans and Swahili sitting at 0.003% and 0.00135% comparatively. China, for example, only accounts for 1.8% of internet content despite having 1.3 billion native speakers. Because of this, English can be classified as an extremely high-resource language.

The compute divide also attributes to the rise in English dominance in LLMs and generative AI systems more broadly. The United States ranks among the top two global market shareholders for logic chip production at every manufacturing stage, leading with a 52% gap in design. Additionally, eight U.S.-based companies own and operate a staggering 490,000 (approx.) private H100 GPU processors with Meta having 350,000. The most resourced company outside of the U.S.—Scaleway in France—operates 1,016. There is also a large number of privatized A100 GPU processors owned by U.S.-based companies with four companies holding two-thirds of the global share. Given the divide in compute capacity, English models are able to thrive in ways that non-English models cannot, with increased knowledge and training tests.

Data, model, and compute are all integral to the training of LLMs and the creation of foundation models, which can then be fine-tuned to achieve a specific task like machine translation. The surplus of data in the English language coupled with the United States’ lead in model design and compute capacity creates an environment wherein it is difficult to develop non-English models with comparable capabilities and safety mitigations. There are, however, key actors working towards bridging the gap and multilingualism. These actors are anywhere from grassroots communities and startups to research labs and big tech. At the grassroots and startup level, organizations like Masakhane and Lelapa AI are working to support a higher concentration of African languages within AI development, they support 23 (Masakhane) and five languages (Lelepa AI) respectively. Similarly, regional research labs and events have worked to create and sustain cross-cultural datasets and dialogues focused on the advancement of non-English languages in AI – as AI4BHARAT has done as a research lab in India, and Khipu has in Latin America, for example. Big tech companies have also begun to diversify the languages used across AI models, extending to broader constituencies.

Still, there is a lack of capacity and data for non-English languages that limits cultural relevance and language-specific techniques in AI models and further fragments the market. With a majority of models built supporting only Latin script and/or English linguistic features, many typologically diverse languages and the cultural nuances they carry are lost in translation when building out non-English systems. The fellows argued that regional, participatory, and synthetic dataset creation in low-resource and typologically diverse languages must therefore be prioritized to offset existing fragmentation. Through engaging with culturally diverse stakeholders, LLM developers can create models that are conscious of local dialects and biases while adapting to non-English script, typology, and word order. The presenters offered a few techniques particular to low-resource languages including transfer learning, multilingual learning, data augmentation, and the use of language-specific tokenizers, human-in-the-loop practices, and adapters that are less data-hungry.

While designed to automate and excel in many language processing tasks, many early LLMs were not built for translation and are most capable of translating only high-resource languages. Traditional rule-based machine translation relies upon transformer architectures which excel in analyzing the correlation between two words, but often produce imperfect or unnatural translations. This works well for high-resource languages, however as resources become scarcer, translation becomes less reliable. Newer, statistical machine translation models train at a multilingual capacity and—when finding the correlation between two subsets of language data—are better able to predict contemporary language patterns. Given this, presenters and program participants alike agreed that the use of parameter-efficient adapter modules and language-specific tokenizers will be revolutionary in imagining the future of non-English AI technology and revamping existing models.

Adapter modules can be added to the layers of a pre-trained system, carrying the already-trained-on parameters from a previous model, to allow for continual learning in a multilingual capacity. Existing methods, like fine-tuning and feature-based transfer, change the parameters or add to the function of a pre-trained system, limiting how compact it can be. Additional strategies, like language and script-specific tokenizers, have also been proposed. Tokenizers transform a corpus of text into smaller units known as tokens. This is the first step in processing unstructured text data to train LLMs. The tokens are then transformed—or encoded—into numerical representations for downstream training. Current state-of-the-art tokenizers are designed to work effectively for languages written in the Latin script (e.g., English, Spanish, etc.) and are less capable in non-Latin script languages (e.g., Arabic, Japanese, etc.) affecting the multilingual performance of the LLMs trained. As such, tokenizers that are designed to accommodate the unique structure and patterns of languages written in other scripts are expected to yield better outcomes.

Capacity building for low-resource language models and datasets will therefore require collaboration between big tech, governments, and local organizations or institutions to establish data collection, annotation and labeling initiatives as well as educational and mentorship programs. Through encouraging and uplifting the work done in local forums, whether through funding or open-source tools—and enhancing evaluation models—the language and compute gaps will begin to narrow.

It would also be helpful and beneficial for the cause that emerging tech policies adhere to diversity, equity, and inclusion (DEI) standards, establishing initiatives that promote language fairness and support cultural differences and values. To successfully move beyond English, new AI systems need to preserve cultural histories, customs, and traditions in a manner that best conveys the thoughts and expressions of non-English users and that helps to build a more equitable future. Smaller and more localized initiatives cannot do this by themselves, which makes collaboration across sectors and borders the key to sustaining linguistic diversity.

This summary captures the extensive discussion and insights shared during the meeting held on September 19, 2024, at 10:00am EST while maintaining the anonymity of individual participants as per the Chatham House Rule.

A special thank you to Isaac Halaszi, Research Assistant for the Strategic Foresight Hub at the Stimson Center, who contributed to this event summary.

in Emerging Technology

Discussion Summary