Sovereign AI for Vietnam: Linagora’s Open-Source Journey

Sovereign AI for Vietnam: Linagora’s Open-Source Journey

Building a Sovereign Vietnamese AI

In October 2025, during an AI-focused event in Ninh Binh, Alexandre Zapolsky, President of LINAGORA, took part in a series of discussions that would help shape our AI strategy in Vietnam. Among them, an exchange with Ho Duc Thang, a contributor to AI laws in Vietnam, highlighted a key question: how can Vietnam build sovereign AI solutions while relying on open and transparent technologies?

Vietnam is entering a decisive phase in defining its AI ecosystem. Through our discussions with governmental stakeholders, including teams within the Ministry of Science and Technology (MOST), it has become clear that developing Vietnamese-language AI models is not only a technical challenge, but also a strategic priority. Today, the global AI landscape is largely dominated by major American and Chinese companies, setting the standards for performance and adoption. While these solutions offer strong capabilities, relying exclusively on them could lead to increasing strategic and technological dependency, particularly in sensitive domains such as public services, data governance, and enterprise knowledge management. In this context, fostering local capabilities and sovereign AI alternatives is essential to ensure long-term autonomy and control. MOST has been actively organizing benchmarking initiatives to evaluate the performance of large language models (LLMs) in Vietnamese, creating a dynamic and competitive environment aligned with the country’s broader push for open and sovereign AI. At Linagora Vietnam, we see this as an opportunity to both contribute to and learn from this national effort.

Our ambition aligns naturally with this vision. As part of LINAGORA’s open-source DNA, we bring concrete experience in developing open-source large language models through the OpenLLM France community and through Lucie - our sovereign open-source LLM developed in France. Building on this experience, we aim to develop a strong Vietnamese LLM, not only to ensure linguistic accuracy, but also to capture cultural and contextual nuances that global models often overlook. More broadly, we believe that competing with dominant American and Chinese AI ecosystems requires going beyond isolated, language-specific projects and instead combining knowledge, resources, and experience across a global open-source community.

This initiative is closely tied to our product strategy. We are actively integrating AI-powered assistance into our collaboration suite, Twake Workplace, with a focus on email and document management features. While we do not currently have active deployments in Vietnam, the availability of a performant Vietnamese LLM could become a strong differentiator to enter this market. By enabling high-quality localization and AI-powered features adapted to Vietnamese users, it would significantly reinforce the relevance and competitiveness of Twake Workplace in the local ecosystem.

Finally, we have observed a strong and growing interest among Vietnamese tech communities for AI development and sovereign AI approaches. This was particularly evident during our Open Tech Talk organized in January 2026, where we presented our AI initiatives, including OpenRAG, to students from several universities. The level of engagement and curiosity confirmed the relevance of our approach and the demand for open, locally-driven AI solutions. Building on this momentum, we launched an internship program focused on benchmarking and training Vietnamese language models. We have since onboarded our first intern on this topic, who is now working closely with the OpenRAG team led by Andrzej Neugebauer from LINAGORA France. This collaboration illustrates our commitment to fostering local talent while contributing to a broader, international open-source AI ecosystem.

 

Vietnamese LLM Benchmark

Following our ambition to develop impactful AI projects in Vietnam, as outlined above, we initiated the Vietnamese LLM Benchmark project. The primary aim of this project is to investigate the current landscape of open-source large language models (LLMs) processing the Vietnamese language, establish a clear performance baseline, and identify the most suitable model for integration into our OpenRAG initiatives.

Historically, developers and researchers have faced a notable lack of comprehensive, standardized evaluation suites tailored specifically for Vietnamese Natural Language Processing (NLP) tasks. To address this gap and understand what current open-source models are truly capable of, we developed the Vietnamese LLM Benchmark. Additionally, a key objective of this report is to facilitate the selection of an optimal Vietnamese LLM for deployment within the OpenRAG project.

 

Models

To thoroughly assess the current ecosystem, the benchmark evaluates three distinct open-weights models: Qwen3.5-9B, Qwen3-8B and Unicorn-VL-R3 (a Vietnamese fine-tuned model).

Dataset

The evaluation spans diverse datasets designed to test different cognitive and generative capabilities. We utilized four distinct datasets for the evaluation:

VMLU

A multiple-choice benchmark with 744 questions covering a wide range of knowledge and reasoning difficulty levels.

UIT-ViSquAD2.0

Contains 1,000 question-answer pairs from 174 Vietnamese Wikipedia articles. It evaluates long-context understanding and includes 10% unanswerable questions to test hallucination avoidance.

Vietnamese Multiple Document Summarization Dataset (ViM)

Consists of 100 news clusters requiring abstractive summarization across multiple documents, testing coherence and synthesis.

Vietnamese Instruct General Dataset (VTSNLP)

A large dataset with 4.5 million samples covering tasks such as summarization, translation, inference, and content generation.

Due to hardware constraints, only about 10% of each dataset was used.

 

Method

We designed a memory-efficient pipeline that sequentially loads each model to generate predictions. This approach ensures that the entire benchmark can run on a single, highly accessible GPU, such as a Kaggle T4 GPU.

Each question was converted into a standalone prompt to simulate real-world usage and ensure zero-shot evaluation. This prevents context leakage between samples and ensures consistent testing conditions.

Performance was measured using standard metrics including Accuracy, Exact Match (EM), F1, and ROUGE-L. Additionally, for open-ended tasks, we used an “LLM-as-judge” approach. SeaLLMs-v3-7B-Chat scored outputs on a scale from 1 to 10 based on criteria such as accuracy, faithfulness, and coherence.

 

Result

DatasetMetricQwen 3.5 9BQwen 3 8BUnicorn-VL-R3
VMLUAccuracy75.91%66.67%67.07%
ViSquAD2.0F175.42%50.75%67.73%
ViSquAD2.0EM48.90%6.90%39.20%
ViMROUGE-L46.6945.7350.73
ViMLLM-as-judge7.487.627.59
VTSNLPLLM-as-judge7.577.527.53

The benchmark results show that larger parameter models continue to hold a distinct advantage in raw knowledge retrieval and exactness. Notably, Qwen3.5-9B secured the highest VMLU Accuracy at 75.91% and the highest ViSquAD2.0 F1 score at 75.42%.

However, the Unicorn-VL-R3 model proves that fine-tuning can rapidly close the performance gap. Unicorn-VL-R3 significantly outperformed the baseline Qwen3-8B model in reading comprehension. Crucially, it achieved the benchmark's highest ROUGE-L score of 50.73 on the complex multi-document summarization task (ViM). This suggests that fine-tuning vastly improved its ability to synthesize, rephrase, and organize disparate pieces of information.

When evaluated by the LLM judge, all three models achieved remarkably similar scores, clustering tightly between 7.48 and 7.62. This indicates that while exact factual retrieval scales with model size, the foundational ability to generate coherent, natural, and relevant Vietnamese text is already highly capable across today's accessible open-source models.
 

Conclusion and Next Steps

As a next step, we plan to extend this work by running the benchmark on the full datasets, leveraging more powerful infrastructure provided by OVHcloud. This upgraded environment will allow us to obtain more comprehensive and statistically reliable results, while reflecting the same type of production setup used to deploy our OpenRAG solutions.

Beyond evaluation, we see this benchmarking framework as a foundational building block for future developments. It provides a structured and reproducible environment to iteratively train, fine-tune, and validate Vietnamese language models, ensuring continuous progress driven by measurable performance gains. In this sense, benchmarking is not just an assessment tool, it becomes a core component of the model development lifecycle.