Open source models in your company — when Mistral, GPT-oss and Qwen are enough instead of GPT, Gemini and Claude

Short answer: in 2026 the quality gap between top open source models and commercial flagships has narrowed enough that for most business use cases — especially RAG on your own documents — open models running on European infrastructure are not just sufficient, they are often more practical. The rest of this post shows when that holds, and when it is still worth reaching for a commercial model.

Why this discussion is even on the table

A year ago a conversation about open source in an enterprise context boiled down to “nice, but for production we use GPT”. Today the picture has changed for three reasons.

First — the quality of open models has stopped being an argument against them. GPT-oss-120B reaches scores close to OpenAI’s o4-mini on reasoning benchmarks, and runs on a single H100. Qwen3-235B is in the global LMArena top 5 with performance comparable to Gemini 2.5 Pro. Mistral Small 3.2 has an intelligence index of 15 on the Artificial Analysis ranking — clearly above the average for models of its class.

Second — a real choice of European infrastructure has emerged. Scaleway hosts all of these models in a sovereign cloud where prompts are not logged, data does not leave the EU, and GDPR compliance is structural rather than contractual. For companies in regulated sectors that is a fundamental shift.

Third — the AI Act starts applying to high-risk systems from August 2026. Using a US hyperscaler’s European region does not meet sovereignty requirements because of the extraterritorial reach of the US CLOUD Act. The topic has stopped being theoretical — it is now a legal compliance decision.

What these three open source models actually do

Mistral Small 3.2 — a European model under Apache 2.0, 24B parameters, 128k token context. Handles text and image. In practice it is the workhorse model — fast, predictable, cheap. A good fit for chatbots, internal tooling, automations and exactly RAG on company documents. After the 3.2 update, instruction-following accuracy improved noticeably (from 82.75% to 84.78%) and the infinite-generation problem nearly halved (from 2.11% to 1.29%). It will not replace flagships in deep reasoning, but for typical business use cases the value-to-cost ratio is the strongest in the Mistral family.

GPT-oss-120B — OpenAI’s first significant open source release since GPT-2, published in August 2025 under Apache 2.0. It is a Mixture-of-Experts model: 117 billion parameters total, but only 5.1 billion active per token. Practical consequence — it fits on a single H100, deployment cost drops 3–5x compared to models that need clusters. Configurable reasoning levels (low, medium, high), tool calling, 128k context. In independent RAG tests by DataRobot the 20B variant with low “thinking effort” was consistently on the Pareto frontier — a smaller, cheaper model often beating a bigger and more expensive one. An important practical lesson: bigger does not mean better.

Qwen3-235B — a Chinese model from Alibaba Cloud, top 5 in global LMArena. Strengths are multilingual quality (better Polish than American open models), low latency at high token efficiency, strong agentic performance. Available on Scaleway through Generative APIs. An important caveat: although the model is open source, the provenance itself can be a counter-argument in some industries — not everything is settled by a licence.

When open source really is enough

The practical answer for most enterprise RAG: yes, it is enough. Tasks where open models do very well:

Question answering on company documents — policies, procedures, offers, contracts. The model does not need world knowledge, it needs to read the supplied context and cite it sensibly. That is exactly the task at which Mistral Small 3.2 and GPT-oss perform very well.
Internal assistants for departments — HR, legal, sales, customer support. Specialisation through retrieval, not through model size.
Classification, data extraction, summarisation — structured tasks where open source typically matches or beats commercial models at a fraction of the cost.
External chatbots on websites — speed and cost matter, not subtlety of reasoning.
Multi-step processes with tool calling — GPT-oss-120B was specifically tuned for agentic use cases.

When commercial models still make sense

To be fair — open source is not always the best answer. Flagship models (GPT-5, Gemini 3, Claude Opus 4.7) still win in several scenarios.

Complex production-grade coding tasks — Claude Opus and GPT-5 still have a clear edge in generated code quality. If you are building an agent that writes application code, do not skimp on the model.

Very long documents with many dependencies — commercial models have larger context windows and are better at finding information deep inside long context.

High-quality creative writing — marketing copy, polishing text, stylistic nuance. Commercial models still have better “taste”.

The hardest mathematical and analytical reasoning — for tasks that demand multi-step, precise reasoning the flagships are still ahead.

Top-tier multimodality — analysing complex images, charts, diagrams — Gemini and GPT-5 are more polished here.

The argument that often gets lost — vendor independence

This is not an ideological or political argument. It is operational and financial.

First — cost control. OpenAI, Anthropic and Google can change pricing, limits or terms at will. They have done so many times. When your product depends on one vendor, you are one press release away from a margin crisis.

Second — continuity control. Models get retired, deprecated, modified. An application that worked great on GPT-4 may behave differently on GPT-5. With an open source model you have a frozen version that you control.

Third — jurisdiction control. For a company in finance, legal or the public sector, “where is the data physically located during inference” stops being academic. It is the auditor’s question.

Fourth — the ability to fine-tune and specialise. Open source models can be fine-tuned on your own data, embedded into an internal pipeline, even modified architecturally. Commercial models give you a fraction of that.

This does not mean cutting commercial models off entirely. The pragmatic approach is a multi-vendor architecture — by default a cheaper open source model on European infrastructure, with the hardest queries routed to a chosen flagship. The customer does not see the difference, but you can change the routing one day without rewriting the application.

Practical recommendations

For companies starting with AI on their own data — start with Mistral Small 3.2 in a European cloud. Low cost, good RAG quality, GDPR compliance. If you outgrow it, GPT-oss-120B is the natural next step.

For companies deploying AI Assistants in organisations with compliance requirements — GPT-oss-120B on Scaleway or an on-premise deployment. Apache 2.0, data sovereignty, control over chain-of-thought.

For companies building multilingual products — Qwen3-235B is worth considering for its language quality, but factor in the constraints that come with the Chinese provenance of the model in some industries.

For top-tier quality tasks — commercial models still have their place, but should be a deliberate choice rather than the default.

What a good model alone will not do

Picking a model is an important decision, but not the most important one. In the RAG projects we run at Web Amigos, 80% of answer quality comes from retrieval architecture, source document quality and prompt design — not from whether you use Mistral or GPT-5. A company with chaotic documentation will not fix it with the most expensive model on the market. A company with tidy data and a thoughtful pipeline will get great results from an open model.

Open models have matured to the point where, for most business use cases, they are the default choice rather than a compromise. The question is no longer “will open source cope” but “which open source and on which infrastructure”. That is good news — the era of being held hostage by a single vendor has just ended.

This post is part of a series on building AI Assistants independent from foreign providers. In the next post we show what a provider really has to satisfy for data sovereignty to be more than a declaration: Where is your AI Assistant’s data physically located? A practical guide to AI sovereignty in the EU.