OpenLLM France: Open and transparent AI with a French twist

OpenLLM France: Open and transparent AI with a French twist

Article written by Julie Hunter, Senior Researcher in Linguistics and Natural Language Processing (NLP), for LINAGORA’s R&D department.

While the R&D team has multiple projects underway (more on us in another post!), we’d like to focus on the one that’s keeping us the busiest right now : OpenLLM France. We’ll have some new language models coming out in the next few weeks, so we thought we would take this opportunity to remind you of what this project is all about.
 

A little history  

When the LLM (large language model) craze hit at the end of 2022, we suddenly had at our disposal high quality LLMs for speech and text that opened up a new world of possibilities that were very relevant to LINAGORA’s use cases, such as meeting summarization and document querying. It was clear, however, that these new tools harbored dark secrets. They were trained by scraping the web, any bit of the web they could reach regardless of intellectual property rights or other concerns about the scraped data, such as extreme toxicity. What’s more, they were trained on heavily disproportionate amounts of English data, bringing along biases carried by anglocentric content. And perhaps the worst part was that these models, and the data used to train them, were entirely closed so there was no immediate way to figure out how to build our own models in line with LINAGORA’s values.   

In this context, LINAGORA recognized a need to develop truly open-source models that did not have to hide their training data and that focused on the French language, in order to better represent French-speaking cultures and to stay true to the open-source values of the company. We rallied to bring actors across France to our side, and the OpenLLM France community was born in 2023. In September 2024, we started the official OpenLLM France project, a two-year grant funded by BPI France.

Through this project, and the collaborations that it affords, LINAGORA has emerged as a unique player in Europe and worldwide, capable of developing truly open source language models dedicated to multilingualism, with a particular focus on the French-speaking world.
 

The ambition of OpenLLM France 

The OpenLLM project aims to contribute digital commons in the form of open data, models and code to help ensure that AI development in France, and Europe more generally, can be self-sufficient.

In more detail, our objectives are to:

  • develop and improve French training corpora in order to mitigate anglocentric bias in training,  
  • publish our datasets under open licenses in the form used for training,
  • share our model weights under open licenses, including not only the final model weights, but those from intermediate training steps,
  • publish the code used for data preparation, model training and evaluation under open licenses. 

     

Our commitment to open data brings with it a commitment to respect the intellectual property of data creators to the extent that we can, in conformity with European directives. This, together with the choice to include high proportions of French data in our training sets, means that we deprive ourselves of large amounts of high-quality English data used by closed models. Nevertheless, we believe this downside is overshadowed by the long term benefits of openly sharing data. 

Not only does our French-centered and open-source approach bring our model training more in line with LINAGORA’s values and many of our use-cases, but it promotes French and European sovereignty by giving Europeans the keys to develop their own models. Intermediate model weights, for instance, allow model developers to redo the end of a model training rather than starting from scratch and they allow researchers to better understand how LLMs acquire different capacities throughout the training process. Open data and code can likewise keep other researchers and developers from reinventing the wheel, allowing them to move on to new problems.
 

Furthering research on LLMs

Within the OpenLLM project, our main research themes include:

  • Multilinguality: From education to health, AI systems need to master the intricacies of the languages spoken in countries in which they are used. Our research focuses on how best to train and evaluate models for use-cases requiring expertise in languages other than English. 
  • Multimodality: AI systems that exploit information from multiple modalities, such as voice or vision, are needed in a variety of scenarios. While working actively on voice-text versions of OpenLLM models, we are also exploring more advanced approaches to conceiving of multimodal conversational assistants with our academic partners.  
  • Education: One of the major objectives of our project is to improve AI use in the domain of education. This implies working with teachers to develop models that support their needs and those of their students, but also collaborating with experts to sensibilize them to the risks tied to AI use in education.

 

Fortunately, we are not alone in this adventure but work together with numerous actors in industry and research, including:

  • CEA List: contributing the creation of new datasets for French
  • LORIA: studying best practices for pretraining and post-training
  • IDRIS: guiding us through model training problems and debugging
  • OpSci: helping ensure that our models respond in accordance with human preferences
  • LIX: furthering the development of evaluation methods specific to French
  • Class’Code: coordinating interaction with EdTechs and academies in France
  • CEA and la Sorbonne: analyzing the ethical and legal impact of our project and LLM development more generally
  • Mens Data:  quantifying the environmental impact of our model training
  • TALK’R: supporting dataset production through tasks such as web scraping

We plan to start publishing our data in models in the next few weeks so stay tuned! You’ll be hearing a bit more from us very soon…