Introduction
This post details the setup and deployment of a local Large Language Model (LLM) serving solution on Kubernetes using Ollama and Open WebUI. The entire infrastructure is managed via GitOps principles using ArgoCD, ensuring declarative, version-controlled deployments.
Project Overview
The llama project provides a robust and scalable environment for running LLMs, featuring:
- Ollama: A powerful framework for running and managing various open-source LLMs locally.
- Open WebUI: A user-friendly web interface for interacting with Ollama, making it easy to manage models and chat with them.
- Kubernetes: Orchestrates the deployment, scaling, and management of the application components.
- ArgoCD: Implements GitOps by continuously synchronizing the desired state defined in Git repositories with the live state of the Kubernetes clusters.
Architecture and Components
The deployment consists of several key Kubernetes resources, all defined as YAML manifests and managed through ArgoCD:
1. Namespace
A dedicated llama namespace is created to logically isolate the application’s resources within the Kubernetes cluster.
2. Storage
llama-storage-class: A StorageClass configured for NFS, enabling persistent storage for Ollama models and Open WebUI data. This allows models to persist across pod restarts and ensures data integrity. The NFS server address (<ip address>) and path (<nfs share>) are specified for dynamic provisioning.webui-pvandwebui-pvc: A PersistentVolume and PersistentVolumeClaim are defined for the Open WebUI, ensuring its data (e.g., user settings, chat history) is retained.ollama-volume(via StatefulSet): Ollama uses a VolumeClaimTemplate within its StatefulSet to provision persistent storage for its models, utilizing the<storage Class>storage class.
3. Ollama Deployment
ollama-statefulset: Deploys the Ollama server as a StatefulSet, ensuring stable network identifiers and ordered, graceful deployment/scaling.- Image:
ollama/ollama:latest - Resource Limits: Configured with significant CPU (
8000m), memory (12Gi), and GPU (nvidia.com/gpu: 1) requests and limits, indicating support for hardware-accelerated LLM inference. - Volume Mount: Mounts
ollama-volumeto/root/.ollamafor model storage.
- Image:
ollama-service: A ClusterIP Service exposing Ollama on port11434for internal cluster communication.
4. Open WebUI Deployment
webui-deployment: Deploys the Open WebUI as a standard Deployment.- Image:
ghcr.io/open-webui/open-webui:main - Environment Variable:
OLLAMA_BASE_URLis set tohttp://ollama-service.llama.svc.cluster.local:11434, linking it to the Ollama service within the cluster. - Volume Mount: Mounts
webui-volumeto/app/backend/datafor persistent application data.
- Image:
llama-service: A ClusterIP Service exposing Open WebUI on port8080for internal cluster communication.
5. External Access
llama-ingress: An Nginx Ingress resource that provides external HTTP/HTTPS access to the Open WebUI.- Host:
<domain> - TLS: Configured with
cert-managerfor automatic certificate issuance (letsencrypt-stage) and uses a secret (<cert>) for TLS termination. - Annotations: Includes annotations for session affinity (
nginx.ingress.kubernetes.io/affinity: "cookie") and large proxy body size (4096m), crucial for handling LLM interactions.
- Host:
GitOps with ArgoCD
The entire application lifecycle is managed by ArgoCD. The basic-application.yaml, helm-application.yaml, and kustomize-application.yaml files demonstrate different approaches (raw manifests, Helm, Kustomize) for defining ArgoCD Applications that point to the llama project’s Git repository. ArgoCD automatically synchronizes the state of these applications with the Kubernetes cluster, enabling automated deployments, rollbacks, and drift detection.
Conclusion
This llama project provides a comprehensive and production-ready setup for deploying LLMs using Ollama and Open WebUI on Kubernetes. By leveraging GitOps with ArgoCD, it ensures maintainability, scalability, and robust management of the LLM infrastructure.