CPDP 2024 - European Open Source generative AI: progress and challenges

At the World Congress on the Protection of Personal Data (CPDP), Michel-Marie MAUDET, Managing Director and co-founder of LINAGORA, presented LINAGORA's activities, and in particular the various initiatives recently launched in the field of Artificial Intelligence. Michel-Marie MAUDET is now one of the leading ‘knowers’ and ‘doers’ in the Open Source ecosystem. His insights, opinions and reflections are regularly sought at the highest level and here at international level from Brussels.

OpenLLM Europe, a major player in Open Source AI

Michel-Marie MAUDET began his speech with an introduction to the OpenLLM-France community, now extended to OpenLLM-Europe since February 2024. This initiative aims to build digital alternatives around a community that is passionate about Open Source generative AI. Today, it has more than 750 members, bringing together manufacturers, academics and public organisations. This synergy of skills is undoubtedly a crucial driver for innovation and technological independence.

" We firmly believe that a truly Open Source model for AI is essential - it's a question of biodiversity! As in nature, biodiversity ensures the resilience and health of the ecosystem. "

In the field of artificial intelligence, it is essential to promote a truly Open Source model, believing that digital biodiversity is the key to ensuring the resilience and health of an ecosystem. By creating digital diversity, we can encourage innovation, equity, inclusivity and accessibility of this technology to everyone.

It was in this spirit that Michel-Marie launched the OpenLLM initiative, which today has two key objectives. Firstly, the construction of digital commons in the field of generative AI: assets used by all, without restriction, to encourage innovation and the development of new technologies. The second objective is to build an Open Source community to share and pool efforts in building these commons.

By opening up sources and sharing knowledge, it is possible to create a strong and dynamic community that can respond to the future challenges facing our societies. The members of the OpenLLM Community are all convinced that this is the only way to ensure fair and responsible use of artificial intelligence. Michel-Marie in particular expressed his pride in the progress of the work. The community is now ready to take things to the next level: developing a brand new 100% Open Source LLM model, LUCIE.

LUCIE: the new model for the OpenLLM community

To form and train a new model, it is necessary to create, upstream, what is known as a tokenizer. Michel-Marie MAUDET therefore stresses the importance of creating a European tokenizer to avoid additional costs. What we need to understand is that if you use a model like ChapGPT today, it is mainly driven by English. This means that if you request something in Spanish, French or German, you will have to pay an additional cost of 30%, because it is not optimised for these languages. By creating a European tokenizer, it is possible to adapt the models to the language environment, and thus benefit from the same quality of service as other users.

But training this type of model requires colossal human and financial resources:

" Computational resources are very expensive and very important. For CLAIRE, it wasn't 200,000 hours, but now with our dataset, we're close to 1 to 2 billion GPU hours. To imagine, we are training and starting to test configurations with 256 GPUs, so it takes more than 200 days to train the model. "

Support from the French government: invaluable assistance for IA projects

Michel-Marie also took the opportunity to welcome the recent support for open source AI developments from the French government. Indeed, just the day before, President Emmanuel Macron had announced the winners of a France 2030 call for projects to build digital commons in generative artificial intelligence. LINAGORA is a France 2030 winner.

" I'm delighted to announce that we now have the support of the French government."

The OpenLLM initiative has been selected as one of the winners of this programme. This means additional financial support for the next two years. The aim is to continue work on the creation of a tokenizer for European languages, thereby encouraging the pooling of efforts at European level. Michel-Marie MAUDET expressed his satisfaction at this support, underlining the importance of strengthened cooperation to take on the technology giants such as GAFAM.

Regulatory and political issues

As you can see, we need to pool our resources and join forces around the various national and European initiatives to develop AI. Beyond political decisions and regulations, it is crucial to start working on concrete projects that can have a real impact on society. That's why the development of the tokenizer for European languages has already begun, despite the uncertainty of future regulations.

Speed, responsiveness and innovation are essential to progress in this area, and Open Source technologies can play a key role in this process. Indeed, there are already examples of community-influenced regulation and policy. If it is shown that Open Source technologies are not a brake on AI, they could therefore have a crucial impact on the future version of an AI Act. To achieve this, LINAGORA's teams are working closely with researchers and public research teams, such as the CNRS.

" Open source is an excellent way of combining efforts between research communities and industry players like us."

Michel-Marie MAUDET's presentation to the CPDP 2024 highlighted the key role that Open Source can play in the field of artificial intelligence. Initiatives such as OpenLLM through models such as LUCIE demonstrate that institutional collaboration and support are essential to building inclusive, innovative and resilient technologies. With joint efforts and a clear vision, Europe can position itself as a leader in the global open source AI landscape.

How can I help you?

Your email address

Country

CAPTCHA

Math question 3 + 6 =

Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.