AYA, the ambitious multilingual AI project

A few weeks ago, we had the great privilege to host at Villa Good Tech, Sara Hooker, VP research of Cohere for AI, 1ʳᵉ enterprise AI company focused on data security by building forward-thinking AI models, with leading multilingual capabilities, designed to solve real-world business challenges.

Cohere for AI is pursuing ambitious projects aimed at making AI more accessible, efficient and representative of the world's linguistic diversity. Among its initiatives is the AYA project, which today mobilises thousands of researchers from all over the world to tackle complex technical challenges linked to multilingual modelling.

 

Cohere for AI: the multilingual AYA project, a collaborative effort on AI

 

AYA and the vision of multilingual AI

The AYA initiative was born out of a desire to transform multilingual machine learning. Unlike standard linguistic models, which are often dominated by the most widely spoken languages, AYA took on the colossal challenge of covering 101 languages, many of which are under-represented, such as Wolof, Shona and Hausa. The initiative has brought together 3,000 researchers from 119 countries, creating an unprecedented framework for collaboration that transcends linguistic and cultural boundaries.

Sara Hooker explained that the project aimed not only to develop large-scale models, but also to optimise the efficiency of AI in contexts where computing resources are limited. This has involved advanced techniques to optimise models of 103 billion parameters, while working on more efficient approaches to reduce latency and make AI accessible to a wider range of communities.

To illustrate her point and the importance of developing this type of model, Sara recalls a quotation from Ludwig WITTGENSTEIN :

" The limits of my language mean the limits of my world. "

Image
Image

 

The technical challenges of multilingual modelling

The development of multilingual AI models requires significant advances to overcome complex technical obstacles. Sara describes several major obstacles, including the scarcity of data in certain languages and the limitations of current tokenisation techniques.

For example, languages that use non-Latin scripts, such as Chinese and Hindi, are often penalised because standard tokenisation tools are not adapted to them, increasing the cost and complexity of processing them.

Another key challenge is the ‘double constraint of scarce resources’, where speakers of data-poor languages also have limited access to computational resources. Sara points out that "80% of languages have almost no text available, which demonstrates the importance of multimodal solutions, such as the use of audio to augment available data."

Sara HOOKER sums it up with a quote from Vukosi MARIVATE:

" When you are not taking part in the conversation, it takes place in your absence and not with you. "

 

An optimised approach for enhanced performance

What sets the AYA project apart is its innovative approach to multi-task learning and model alignment. Rather than training separate models for specific tasks, Cohere for AI has adopted a universal paradigm, training a single model on various tasks to promote more efficient knowledge transfer. This methodology reduces the amount of data required and improves performance on complex tasks.

Data optimisation was also crucial. AYA introduced a rigorous data cleaning and quality assessment process, with peer-reviewed contributions to ensure their relevance. Researchers developed quality indicators and even removed unnecessary data to improve the overall performance of the models.

 

Image
Image

 

Worldwide, open collaboration

At the heart of Cohere for AI's mission is a commitment to transparency and collaboration. Sara proudly mentioned that unlike many industrial research labs that keep their work under wraps, Cohere for AI publishes its research widely. This includes publishing model weights and training datasets, an essential step in allowing other researchers and institutions to exploit these resources.

The AYA initiative also set important precedents in terms of language coverage. Prior to AYA, few models declared the languages covered, limiting their usefulness in many local contexts. By making their work public, Cohere for AI hopes to encourage greater consideration of linguistic diversity in AI.

 

Image
Image

 

What are the next steps?

Looking to the future, Cohere for AI plans to develop specialised language models by geographic region, to maximise efficiency and solve tokenisation problems. Plans for models focused onEurope andAsia are in the pipeline, aiming to leverage the benefits of cross-linguistic transfer while increasing the available capacity for each language family.

The work of Cohere for AI and Sara Hooker demonstrates an ambitious and inclusive vision of AI. The AYA project is an important step towards a more equitable AI, where every language has the opportunity to be represented and integrated in tomorrow's digital world.

 

 

How can I help you?

CAPTCHA
1 + 3 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.