In 2026, self-hosting a large language model is no longer a lab curiosity. It has become a mature, performant and economically defensible option for a growing share of the projects we scope. Open-weight models like Mistral Large 2, Llama 3.3 70B, Qwen 2.5 72B or DeepSeek V3 now reach 90 to 95% of the performance of GPT-4 Turbo or Claude 3.7 Sonnet on most standard use cases: extraction, classification, summarization, assisted generation, RAG. This quality gap, which was a deal-breaker in 2023, has narrowed to the point that the question is no longer "is it good enough?" but "what's the right trade-off for our business?"

This article speaks to CTOs, CIOs and lead engineers seriously evaluating the self-hosted LLM option. We detail the six structuring decisions: sovereignty criteria, model choice, GPU sizing, serving stack, security and observability, FinOps. We rely on what we concretely see on our projects, without buying into the ambient marketing narrative. Sovereignty doesn't mean "self-host everything at all costs": the question is where to draw the line, and with what engineering bar.

Why this question is back with force in 2026

Three forces converge. First, regulation is tightening: GDPR for any personal data, HDS for the health sector, NIS2 for essential operators, the AI Act for high-risk systems, plus specific contractual requirements that large accounts now impose on their providers. Many of these clauses explicitly prohibit transfer to a non-European AI API or require full traceability impossible to guarantee on a third-party API.

Second, the open-weight ecosystem has caught up. Mistral Large 2 (123B parameters), Llama 3.3 70B, Qwen 2.5 72B and DeepSeek V3 (671B in MoE, ~37B active) are now industrial-grade models. Paired with servers like vLLM or TensorRT-LLM and NVIDIA H100, A100 or L40S — or even AMD MI300X — hardware, they deliver real throughputs of 30 to 80 tokens per second per request, with dynamic batching that absorbs hundreds of concurrent users on a single properly sized server.

1. Define sovereignty criteria before everything else

The most common mistake is to pick a model first and then dress the project in a sovereignty narrative. The rigorous approach is the reverse: first qualify data sensitivity, regulatory obligations and contractual commitments, then unroll the technical implications. Identifiable health data mandates HDS hosting and rules out any non-EU API. A document classified by a defense client mandates total network segregation. An R&D trade secret mandates at minimum encryption in transit and at rest, and ideally infrastructure outside the LLM operator's hands.

In our audits, we systematically separate three levels: public or non-sensitive data (proprietary API acceptable), confidential internal data (European API with strict DPA or sovereign cloud self-host), sensitive or regulated data (self-host mandatory, ideally on-prem or SecNumCloud / HDS cloud). This grid objectifies the need and avoids the "all or nothing" that bogs down most AI committees.

2. Pick the right open-weight model

Model choice depends on four criteria: quality on your use cases, specific French quality, size (and therefore infra cost), and license. Mistral Large 2 remains our reference for demanding French-language cases: reasoning quality, instruction-following, natural generation. Llama 3.3 70B offers excellent quality/size ratio but its license (Acceptable Use Policy) remains restrictive for some uses and excludes very large platforms. Qwen 2.5 72B is surprisingly strong on code and math. DeepSeek V3, in MoE, offers quality close to GPT-4 with a paradoxically moderate serving cost thanks to 37B active parameters.

Our method: we never trust generic rankings like MMLU or Chatbot Arena. We build an internal benchmark of 50 to 200 prompts representative of the client's domain, and we measure faithfulness, format, tone, latency and cost on each candidate model. It's the only honest way to decide.

  • Mistral Large 2 (123B): excellent French, clear MRL license, 2-4×H100 deployment
  • Llama 3.3 70B: best quality/size ratio, watch out for the AUP
  • Qwen 2.5 72B: strong on code and reasoning, partial Apache 2.0 license
  • DeepSeek V3 (671B MoE): top-tier quality, reasonable serving cost despite size

3. Size the GPU infrastructure

Sizing isn't intuitive. Three variables matter: available VRAM, target latency (time to first token and tokens/second per user) and concurrency (simultaneous users). For an INT4-quantized Llama 3.3 70B, a single H100 80GB is enough for POC but quickly saturates beyond 5-10 concurrent users. In FP8, you need two H100s. For Mistral Large 2 in production, we typically start with 2 to 4×H100 or an 8×L40S node depending on the load profile.

On the infrastructure side, two options dominate: direct purchase (1×H100 80GB around €30-40k excl. tax, amortizable over 24-36 months) or GPU cloud rental at Scaleway, OVHcloud, Lambda Labs or RunPod, between €2.5 and €4 per hour depending on provider and commitment. European GPU cloud remains relevant when seeking to balance sovereignty and flexibility, provided you verify the provider's real jurisdiction and SecNumCloud status where applicable.

4. Pick the serving stack

Four options structure the market. vLLM has become the open-source reference: PagedAttention, continuous batching, OpenAI-compatible support, massive community. It's our default choice. TGI (Hugging Face text-generation-inference) remains very solid, especially integrated into the HF ecosystem, slightly less performant than vLLM on throughput but very mature in production. TensorRT-LLM from NVIDIA delivers the best raw performance on NVIDIA hardware, at the cost of much higher compilation and maintenance complexity — relevant only at large scale. NVIDIA NIM offers a managed on-prem approach with ready-to-use containerized microservices: interesting when the internal team lacks a senior MLOps profile, less flexible and more expensive long term.

For projects up to a few hundred concurrent users, vLLM behind a reverse proxy (Traefik, NGINX) with an OpenAI-compatible client on the application side is the simplest, most performant and most maintainable combination. That's what we deploy on most of our self-hosting projects.

5. Secure and observe the LLM in production

A self-hosted LLM is never just an API to expose. It's a critical component that potentially manipulates sensitive data and constitutes a new attack surface. We systematically enforce four lines of defense: strict network isolation (private VPC, no direct internet access), strong application authentication (signed JWTs, rotation), full logging of prompts and responses with controlled retention, and business guardrails (input filters, output validation, prompt-injection detection).

Technical observability matters just as much: GPU monitoring (utilization, memory, temperature), serving metrics (TTFT, tokens/sec, queue size), application traceability (Langfuse, OpenLLMetry, or homemade solution). Without this telemetry, you won't know when your infra saturates, when a model drifts, or where a quality regression comes from.

6. FinOps: model TCO honestly

The financial question doesn't boil down to GPU price. The real TCO of a self-hosted LLM includes hardware amortization or rental, electricity and cooling if on-prem, internal or external MLOps skills, model maintenance (updates every 3-6 months on open-weight), observability, security, and upgrade phases. On our projects, we always build a 24-month comparison between three scenarios: pure proprietary API, hybrid (API for peak, self-host for recurring), full self-host.

For reference: Mistral Large 2 on 1×H100 80GB in optimized vLLM serving delivers about 50 tokens/second per request and 1,500 to 2,500 tokens/second in aggregated throughput depending on batching. Over 30 full days, that represents a theoretical volume of several billion tokens, to compare against an equivalent API bill that runs into tens of thousands of euros per month at the same volumes.

The pitfalls that wreck self-hosting projects

Four pitfalls recur in our catch-up audits. First, underestimating observability: without telemetry, the team flies blind and any outage becomes a disaster. Second, neglecting network security: too many POCs go to production with a vLLM endpoint exposed in the clear. Third, forgetting the recurring cost of model updates: an open-weight LLM from 2026 will be obsolete in 2027, the upgrade cadence must be planned from scoping. Fourth, not training the team: a self-hosted LLM isn't operated like a classic microservice, MLOps and ML engineering skills are essential.

To these technical pitfalls add a strategic one: confusing sovereignty with performance. Self-hosting doesn't make a bad use case good. If the business need is poorly scoped, if RAG is neglected, if evaluation is absent, the project will fail with or without sovereignty. Self-hosting is a means, not an end.

What's next?

If you're seriously evaluating self-hosting an LLM, the first step is to honestly qualify your sovereignty need, target volume and use-case criticality. That's precisely what we do on our scoping engagements at DevHighWay: decide between API, hybrid and self-host with numbers, not slogans.

  • Get in touch for a sovereignty scoping engagement (2-day workshop, decision-grade deliverable)
  • Our audit also covers the visibility dimension of your AI platform
  • Check our pricing for plans including LLM managed services

Self-hosting an LLM is today a credible option, sometimes the only one, for organizations that need to keep control over their data. But it's a serious engineering project to be treated as such: clear objectives, benchmarked model, sized infrastructure, security and observability tight, FinOps documented. That's exactly the posture we adopt on our projects, and the one that durably separates lasting deployments from POCs that die at the first production incident.