All Articles
Infrastructure9 min read14 May 2019

Kubernetes in Production: What Nobody Tells You

Running Kubernetes in a lab is manageable. Running it in production for a year is an education. Here are the things I wish someone had told me before we started.

KubernetesDevOpsInfrastructureProduction

The gap between "Kubernetes is working in our test environment" and "Kubernetes is running reliably in production" is larger than most teams anticipate. We learned this the hard way.

Our team had followed the tutorials, worked through the concepts, and had a cluster running. The basic operations were understood. Pods deployed. Services exposed. Ingress configured. Then we moved to production and the education began.

Networking was the first surprise. Kubernetes networking works by adding multiple layers of abstraction on top of the underlying network. Services get virtual IPs that do not correspond to any physical address. Pod IPs change when pods restart. DNS is provided by CoreDNS running inside the cluster. When something goes wrong with networking, debugging requires understanding all these layers simultaneously. Our first production incident was a CoreDNS issue that looked like a random subset of service calls failing. It took half a day to diagnose.

Resource limits were the second lesson. Kubernetes lets you set CPU and memory requests (what your pod needs) and limits (the maximum it can use). Getting these numbers right matters more than the documentation suggests. Set limits too low and your pod gets killed mid-request when traffic spikes. Set requests too high and your nodes have unused capacity while other pods are waiting to schedule. We spent the first month adjusting resource specifications based on real production traffic.

RBAC (Role-Based Access Control) in production is both essential and painful. In a test environment it is tempting to give everything wide permissions because debugging is easier. In production, you need proper RBAC, which means understanding what each component actually needs access to. The principle of least privilege is right, but implementing it requires careful thought and produces configurations that are long and hard to review.

The upgrade process is underestimated. Kubernetes releases a new minor version every few months and older versions stop receiving security patches. Upgrading a production cluster is not trivial. You need to drain nodes, upgrade control plane components, upgrade worker nodes, and verify that your workloads still function correctly. With enough workloads, this is a significant operational event.

What helped us most was investing in good monitoring and observability early. Kubernetes produces a lot of metrics and knowing what to watch, and what constitutes normal versus abnormal, required time and deliberate effort. The clusters that ran well in production were the ones with good observability. Problems were caught early because someone noticed a metric trending in the wrong direction.

Found this useful?

Share it with someone who'd enjoy it.