Deploying LLMs with Ollama and Open WebUI on Kubernetes

Introduction

This post details the setup and deployment of a local Large Language Model (LLM) serving solution on Kubernetes using Ollama and Open WebUI. The entire infrastructure is managed via GitOps principles using ArgoCD, ensuring declarative, version-controlled deployments.

Project Overview

The llama project provides a robust and scalable environment for running LLMs, featuring:

Ollama: A powerful framework for running and managing various open-source LLMs locally.
Open WebUI: A user-friendly web interface for interacting with Ollama, making it easy to manage models and chat with them.
Kubernetes: Orchestrates the deployment, scaling, and management of the application components.
ArgoCD: Implements GitOps by continuously synchronizing the desired state defined in Git repositories with the live state of the Kubernetes clusters.

Architecture and Components

The deployment consists of several key Kubernetes resources, all defined as YAML manifests and managed through ArgoCD:

1. Namespace

A dedicated llama namespace is created to logically isolate the application’s resources within the Kubernetes cluster.

2. Storage

llama-storage-class: A StorageClass configured for NFS, enabling persistent storage for Ollama models and Open WebUI data. This allows models to persist across pod restarts and ensures data integrity. The NFS server address (<ip address>) and path (<nfs share>) are specified for dynamic provisioning.
webui-pv and webui-pvc: A PersistentVolume and PersistentVolumeClaim are defined for the Open WebUI, ensuring its data (e.g., user settings, chat history) is retained.
ollama-volume (via StatefulSet): Ollama uses a VolumeClaimTemplate within its StatefulSet to provision persistent storage for its models, utilizing the <storage Class> storage class.

3. Ollama Deployment

ollama-statefulset: Deploys the Ollama server as a StatefulSet, ensuring stable network identifiers and ordered, graceful deployment/scaling.
- Image: ollama/ollama:latest
- Resource Limits: Configured with significant CPU (8000m), memory (12Gi), and GPU (nvidia.com/gpu: 1) requests and limits, indicating support for hardware-accelerated LLM inference.
- Volume Mount: Mounts ollama-volume to /root/.ollama for model storage.
ollama-service: A ClusterIP Service exposing Ollama on port 11434 for internal cluster communication.

4. Open WebUI Deployment

webui-deployment: Deploys the Open WebUI as a standard Deployment.
- Image: ghcr.io/open-webui/open-webui:main
- Environment Variable: OLLAMA_BASE_URL is set to http://ollama-service.llama.svc.cluster.local:11434, linking it to the Ollama service within the cluster.
- Volume Mount: Mounts webui-volume to /app/backend/data for persistent application data.
llama-service: A ClusterIP Service exposing Open WebUI on port 8080 for internal cluster communication.

5. External Access

llama-ingress: An Nginx Ingress resource that provides external HTTP/HTTPS access to the Open WebUI.
- Host: <domain>
- TLS: Configured with cert-manager for automatic certificate issuance (letsencrypt-stage) and uses a secret (<cert>) for TLS termination.
- Annotations: Includes annotations for session affinity (nginx.ingress.kubernetes.io/affinity: "cookie") and large proxy body size (4096m), crucial for handling LLM interactions.

GitOps with ArgoCD

The entire application lifecycle is managed by ArgoCD. The basic-application.yaml, helm-application.yaml, and kustomize-application.yaml files demonstrate different approaches (raw manifests, Helm, Kustomize) for defining ArgoCD Applications that point to the llama project’s Git repository. ArgoCD automatically synchronizes the state of these applications with the Kubernetes cluster, enabling automated deployments, rollbacks, and drift detection.

Conclusion

This llama project provides a comprehensive and production-ready setup for deploying LLMs using Ollama and Open WebUI on Kubernetes. By leveraging GitOps with ArgoCD, it ensures maintainability, scalability, and robust management of the LLM infrastructure.