The LLM LUCIE 7B and the LUCIE training dataset: lessons learned from training a real open-source AI model

During his presentation at the OW2 con'25 in June 2025, Jean-Pierre LORRÉ, our Research Director, introduced a uniquely committed approach to truly open-source AI. Through the development of our LUCIE 7B model, we aim to address crucial issues of technological sovereignty, cultural representation, and transparency.

Back to Basics: What is Open Source AI?

The current definition of open-source artificial intelligence revolves around four fundamental freedoms:

Freely use the system,
Study its functioning,
Modify its code,
Share modified versions.

Applied to AI, this definition implies providing not only the source code but also the model weights, pre-training code, and, most importantly, the data used.

According to this definition, many so-called "open" models are not truly open source. Jean-Pierre LORRÉ mentions Meta's LLaMA and Mistral, which, although distributed with accessible weights, do not make their datasets or truly permissive licenses public. The result is a lack of transparency that hinders auditability, reproducibility, and free reuse.

LINAGORA's DNA: Promoting Free and Ethical AI

For over 25 years, we have established ourselves as a pillar of free software in France. Our involvement in artificial intelligence naturally stems from this culture:

« Our motivation is to foster truly open-source artificial intelligence.»

This commitment has led to the creation of the OpenLLM community, a collaborative initiative bringing together researchers, companies, and enthusiasts around open language models. The LUCIE 7B model fits into this logic, with a clear commitment to respecting all open-source criteria.

LUCIE 7B: An Open, Transparent, and Sovereign Model

The LUCIE 7B project was designed with a goal of maximum transparency. All components necessary for reproducing the model are available:

Model weights,
Pre-training code,
Description of the datasets used,
Tokenization model,
And complete documentation on Hugging Face.

Moreover, the approach emphasizes a strong presence of the French language (over 30% of the initial dataset), addressing another strong motivation: combating the linguistic and cultural hegemony of predominantly English-speaking large models.

« A language is not just a tool: it conveys a culture, a history, a cuisine, a worldview.»

Why This Matters

In a context of rapid growth of AI models, open source constitutes a democratic and ethical safeguard. Without access to training data, preprocessing methods, or alignment mechanisms, users have no way to understand, correct, or adapt the systems they use.

We aim to demonstrate that it is possible to produce powerful models that respect the principles of freedom and are adapted to local contexts while using the best available technologies (512 H100 GPUs via Jean Zay, Megatron/DeepSpeed, etc.).

With LUCIE 7B, we want to prove that it is possible to develop ethical, open, sovereign, and high-performing AI. But this requires will, public resources (such as support from France 2030), and collective mobilization of the open-source community.

At a time when AI is shaping our societies, it is more necessary than ever to defend a technological model based on transparency, inclusion, and freedom.

Twake Workplace

Twake Chat

Twake Drive

Twake Mail

LinShare

LinTo

OSSA

SmartSLA

Community

Apache James

Mission

Vision

Why Choose Open Source

Customer Success

Our events

Villa Good Tech

The LLM LUCIE 7B and the LUCIE training dataset: lessons learned from training a real open-source AI model

Back to Basics: What is Open Source AI?

LINAGORA's DNA: Promoting Free and Ethical AI

LUCIE 7B: An Open, Transparent, and Sovereign Model

Why This Matters

Newsroom

Feedback: Linh Vu has his say

Digital sovereignty: Berlin Summit shows the way forward for open-source Europe

OpenLLM Builders Day at FOST2025: The future of sovereign AI is being built in La Défense

Accelerating digital transformation: a look back at the Mediterranean AI Forum in Tunis

We will be at X-Forum 2025!

How can we help you?