Project

Kubernetes LLM Inference Platform

A production-grade, GitOps-managed platform for self-hosted LLM inference at scale: GPU substrate, vLLM/KServe/llm-d serving, inference-aware routing, tenant budgets, observability, and developer AI workflows.


Self-hosting a model is easy to demo and hard to operate. The missing layer is everything around the model server: who can call it, which model they can see, how much they can spend, where the request should route, which signal should scale capacity, how secrets enter the cluster, and how you know the system is healthy.

This project turns those decisions into a runnable Kubernetes stack. It is not a chatbot app. The model server is one workload inside a platform and it scales, from a single GPU to multi-GPU nodes, as configuration rather than a rewrite.

The platform as a layered stack: experience, tenant edge (keys, budgets, guardrails, SSO), inference-aware routing, serving, GPU scheduling, and cloud substrate, on an Argo CD, External Secrets, and Prometheus foundation.

Core proof

GPU Kubernetes on GKE, raw vLLM serving, Prometheus/Grafana/DCGM telemetry, and benchmark data for TTFT, ITL, throughput, queue depth, and KV-cache pressure.

Portability proof

OpenTofu owns the substrate; Argo CD owns the in-cluster platform. GKE consumes managed GPU plumbing; Hetzner validates the self-managed GPU Operator path.

Platform proof

LiteLLM virtual keys and budgets, GIE inference-aware routing, KEDA queue-depth autoscale, External Secrets, and three serving layers validated on multi-GPU; raw vLLM, KServe, and llm-d (disaggregated prefill/decode with KV-aware routing).

Workflow proof

Open WebUI, Tabby, n8n, and an MCP tool gateway sit above the inference platform instead of turning the repo into an AI homelab bundle.

Governance & cost

SSO via Dex + oauth2-proxy, PII and prompt-injection guardrails at the tenant edge, Kyverno and default-deny NetworkPolicy enforcement, and OpenCost for per-tenant cost allocation.

The request path is layered on purpose:

Client / developer tool
  -> LiteLLM tenant edge
  -> Gateway API + GIE routing
  -> vLLM / KServe serving layer
  -> GPU substrate

The important decision is not the component list. It is the ownership boundary: LiteLLM owns tenant economics, GIE owns endpoint selection, vLLM owns token serving, KEDA owns queue-depth autoscaling, OpenTofu owns the cloud substrate, Argo CD owns the in-cluster platform, and External Secrets owns runtime secret delivery.

LiteLLM owns economics while GIE owns inference-aware routing.

On one NVIDIA L4 serving Qwen2.5-0.5B-Instruct with vLLM:

Signal Observed result
GPU utilization 100%
KV-cache usage 1.1%
Waiting requests 0
Throughput at concurrency 32 3,773 output tok/s
TTFT p50, concurrency 1 to 32 27 ms to 135 ms

The lesson: GPU utilization alone was the wrong scaling signal. Queue depth carried the useful backpressure signal.

These L4 numbers are the cheapest slice, not the ceiling. The same stack runs across multi-GPU nodes with zero-downtime rolling updates; adding capacity is replica count and GPU budget, not rework.

Highly available by design; zero-downtime rolling updates across multi-GPU nodes, with SSO and policy-enforced tenancy, per-tenant budgets and cost attribution, and disaggregated prefill/decode serving for larger models.