RAI Session: Sourcing Training Data for Generative AI Systems

Past
 Event

A discussion exploring innovative solutions to ensure ethical data sourcing

Building or modifying generative AI systems that account for varying cultures and languages in the Global South requires large volumes of training data. However, many countries and regions within the Global South possess limited data given the comparative lack of computing resources, technological skilling, internet availability, and localized data in non-English languages. Efforts to bridge the gap between English and non-English systems and increase data collection with community-driven and regional initiatives include Karya, Masakhane, Lacuna Fund, and Khipu. However, varied internet penetration rates across the thousands of native languages spoken in the Global South complicates the creation of non-English datasets. While internet and digital data are suitable and widely used in English datasets, non-English speakers will likely benefit from more grassroots and participatory methods of gathering data or—eventually—synthetic data creation. Cultural sensitivity and localization will be key for creating diverse datasets that benefit underrepresented linguistic and cultural communities, ensuring that they are not left behind as AI systems increasingly become integral to how we interact with digital systems.

On March 6, 2025, fellows participated in a discussion on the topic of Sourcing Training Data for Generative AI Systems. To seed the conversation, four fellows presented case studies and research findings that highlighted existing mechanisms for non-English dataset creation in their respective regions and the interdisciplinary roles of education, government, public and private sectors in their development.

Discussion Summary

The fourth discussion session focused on building non-English datasets for future generative AI models and systems. There are over 7,000 languages spoken worldwide, each with its own cultural nuances and sensitivities. AI systems that are trained primarily on English data risk cultural homogenization, reinforcing biases and stereotypes with a lack of diversity, and generally may not be utilized by the global majority that is not English-speaking. The English language makes up for just under 50 percent of all web content, with the next closest languages making up between five and six percent respectively. Given ongoing dependence large volumes web content to train advanced generative AI models, the scarcity of widely available non-English data on the web contributes to varying degrees of low performance of generative AI models on those languages. AI developers across the Global South are working to close this gap in several ways, ranging from crowdsourced data to synthetic data creation.

Existing data collection methods include the use of publicly available datasets, crowdsourced and internal data (in the private sector), partnerships with academic institutions, and synthetic data creation. In each of the regions presented by the fellows, the preservation of cultural and linguistic diversity is fundamental in the creation of representative datasets. Contemporary AI and voice technologies have a tendency to disproportionately misrepresent most global languages, especially those spoken by BIPOC and gender-diverse communities. Misrepresentation in these technologies varies from a complete lack of resources in a given language to errors in identifying the speech patterns of local accents and dialects, often resulting in inaccurate or unhelpful responses. Identifying inclusive and ethical practices for data collection will not only strengthen technological sovereignty in these communities but could also be an avenue for data ownership. Current standards for data collection are minimal and seldom enforced, and a diverse multistakeholder approach is needed to increase standards’ efficacy and uptake.

Academia, government, and the private sector can work to foster equity and increase representation in dataset creation. A large pool of data comes from academia already, so public access to datasets created and published by academic institutions is essential. Academia can also serve as an example for independent researchers seeking to source training data by providing inclusive and responsible methodologies for collecting high-quality training and evaluation data. One suggestion is the implementation of an ethical review board for new and emerging datasets, like those keeping academia honest. Governments can similarly regulate and enforce fairness standards, increase public awareness, and establish ethical guidelines through data protection laws and other legal frameworks. When building these guidelines, however, it is important to incorporate the priorities of underrepresented and vulnerable groups of people. Private sector organizations and companies can bring a wealth of resources into play, helping to fund community-led initiatives and providing platforms for data housing and collection.

Latin America

The Latin America region hosts a diverse number of AI initiatives across sectors to address ethical data sourcing. For example, the intergovernmental Roadmap for Ethical Artificial Intelligence for Latin America and the Caribbean 2024-2025 has five main lines of action: Governance and Regulation; Skills and the Future of Work; Protection of Vulnerable Groups; Environment, Sustainability and Climate Change; and Infrastructure. These five lines of action seek to ensure the ethical development and deployment of future AI technologies across the region. Latam-GPT, a LLM developed by Chilean Ministry of Science, Technology, Knowledge, and Innovation (CTCI) and CENIA—and supported by multiple Latin American countries and institutions—is expected to launch in September 2025. The open-source model focuses on preserving the linguistic and cultural diversity of Latin America and encourages public use to further refine regional dialects. It is the first LLM of its size to be developed and deployed fully in the Spanish language.

The National University of Córdoba (Universidad Nacional de Córdoba) in Argentina partners with Latam-GPT on its alignment mechanism, driving value alignment with participatory datasets. In this instance, academia plays a pivotal role in data collection. Together with EDIA, a project of the Vía Libre Foundation, the University implemented a five-month long course that engaged high school teachers and their students in the creation of a dataset representing Córdoba’s values. The course engaged with over 250 teachers and 5,000 students to create 45,000 sentences towards a localized dataset. Students were given a prompt to complete, which was then used to inform components of automatic language processing technologies, helping them to detect stereotypical information. The dataset was representative of many interlocking identities from geographic and socioeconomic to gender and politics. One of the key principles was the ability for participants to opt into the collection process and consent to having their data used when building both Latam-GPT and other local initiatives. Throughout the data collection process, organizers were transparent in communicating the goals of the dataset and its potential gains for those providing personal data.

Africa

In Africa, Mozilla’s Common Voice has become the world’s largest multi-language crowdsourced dataset available, collecting more than 30,000 hours of speech in 115 languages. Globally, technologists use this data to create new AI and voice technologies. Like Latam-GPT, Common Voice grantees encourage public use to continually build upon the data available. Because it is participatory by design, Common Voice relies on communities to input their own language data, a consent-driven approach which has brought voice technology capabilities to Kinyarwanda and Kiswahili speakers.

Indonesia

Similar to the much-anticipated Latam-GPT, Sahabat-AI is the first LLM in Bahasa Indonesia developed by Indonesians to promote linguistic diversity. The model was developed as an open-source model in a collaboration between Indosat, a telecom company, and GoTo, a transportation and e-commerce company. The model supports the over 700 regional languages and dialects in Indonesia, reducing the digital divide and accurately representing linguistic sensitivities. There are many functions to Sahabat-AI. The chatbot function can aid Indonesian citizens with questions on their national identity cards, local and national taxes, name and address changes. The voice assistant function connects to GoPay and Gojek, popular e-wallet and ridesharing applications in the country, and allows citizens to access these commonly used applications with voice commands. When building Sahabat-AI, the two companies utilized both synthetic and publicly available instruction-tuning datasets that were curated by a team of native speakers. The original instruction-tuning dataset that the model was trained on was in English, but through partnerships with national universities, linguists, and native speakers, the team was able to localize and translate the dataset into diverse target languages. The model now has routine evaluation processes in both Bahasa Indonesia and English.

Conclusion

The Fellows’ case studies point to the conclusion that, in order to sustain the creation of non-English datasets, a holistic approach needs to be taken that incorporates existing data collection methods and focuses on engaging diverse stakeholder groups. Current training datasets that pull from the internet and data scraping methods are not only overrepresent the English language and cultural norms and values of the Global North, but they are also often criticized for violating global copyright laws. By realigning to a more community-driven approach to data collection, future AI models trained on these datasets will not only be more globally representative and uphold the linguistic integrity of the communities they serve, but they will also have a higher chance of being adopted more broadly around the world. Intentional partnerships between academia, government, and private sector will help support a community-driven approach.

This summary captures the extensive discussion and insights shared during the meeting held on March 6, 2025, at 10:00am EST while maintaining the anonymity of individual participants as per the Chatham House Rule.

A special thank you to Isaac Halaszi, Research Assistant for the Strategic Foresight Hub at the Stimson Center, who contributed to this event summary.