The LLM LUCIE 7B and the LUCIE training dataset: lessons learned from training a real open-source AI model

The LLM LUCIE 7B and the LUCIE training dataset: lessons learned from training a real open-source AI model

During his presentation at the OW2 con'25 in June 2025, Jean-Pierre LORRÉ, our Research Director, introduced a uniquely committed approach to truly open-source AI. Through the development of our LUCIE 7B model, we aim to address crucial issues of technological sovereignty, cultural representation, and transparency.

 

Back to Basics: What is Open Source AI?

The current definition of open-source artificial intelligence revolves around four fundamental freedoms:

  1. Freely use the system,
  2. Study its functioning,
  3. Modify its code,
  4. Share modified versions.

Applied to AI, this definition implies providing not only the source code but also the model weights, pre-training code, and, most importantly, the data used.

According to this definition, many so-called "open" models are not truly open source. Jean-Pierre LORRÉ mentions Meta's LLaMA and Mistral, which, although distributed with accessible weights, do not make their datasets or truly permissive licenses public. The result is a lack of transparency that hinders auditability, reproducibility, and free reuse.

 

LINAGORA's DNA: Promoting Free and Ethical AI

For over 25 years, we have established ourselves as a pillar of free software in France. Our involvement in artificial intelligence naturally stems from this culture:

« Our motivation is to foster truly open-source artificial intelligence.»

This commitment has led to the creation of the OpenLLM community, a collaborative initiative bringing together researchers, companies, and enthusiasts around open language models. The LUCIE 7B model fits into this logic, with a clear commitment to respecting all open-source criteria.

 

LUCIE 7B: An Open, Transparent, and Sovereign Model

The LUCIE 7B project was designed with a goal of maximum transparency. All components necessary for reproducing the model are available:

  • Model weights,
  • Pre-training code,
  • Description of the datasets used,
  • Tokenization model,
  • And complete documentation on Hugging Face.

Moreover, the approach emphasizes a strong presence of the French language (over 30% of the initial dataset), addressing another strong motivation: combating the linguistic and cultural hegemony of predominantly English-speaking large models.

« A language is not just a tool: it conveys a culture, a history, a cuisine, a worldview.»

 

Why This Matters

In a context of rapid growth of AI models, open source constitutes a democratic and ethical safeguard. Without access to training data, preprocessing methods, or alignment mechanisms, users have no way to understand, correct, or adapt the systems they use.

We aim to demonstrate that it is possible to produce powerful models that respect the principles of freedom and are adapted to local contexts while using the best available technologies (512 H100 GPUs via Jean Zay, Megatron/DeepSpeed, etc.).

With LUCIE 7B, we want to prove that it is possible to develop ethical, open, sovereign, and high-performing AI. But this requires will, public resources (such as support from France 2030), and collective mobilization of the open-source community.

At a time when AI is shaping our societies, it is more necessary than ever to defend a technological model based on transparency, inclusion, and freedom.


LUCIE-LINAGORA-Jean-Pierre LORRÉLUCIE-LINAGORA-Jean-Pierre LORRÉ