Back to Blog
Kubernetes
Tooling
Production
What Actually Goes Wrong in Kubernetes Production?
February 1, 2026
4 min read read
# What Actually Goes Wrong in Kubernetes Production?
## What started the discussion
Hey Kubernetes folks,
I’m curious to hear about real-world production experiences with Kubernetes.
For those running k8s in production:
What security issues have you actually faced?
What observability gaps caused the most trouble?
What kinds of things have gone wrong in live environments?
I’m especially interested in practical failures — not just best practices.
Also, which open-source tools have helped you the most in solving those problems? (Security, logging, tracing, monitoring, policy enforcement, etc.)
Just trying to learn from people who’ve seen things break in production.
Thanks!
## What stood out in the comments
### Discussion point 1
There was a time when everything was recorded here: [https://k8s.af/](https://k8s.af/) I still laugh about it today because I've already been through some of them.
### Discussion point 2
Subnet size for k8s created to small, hit limits needed a new larger subnet ip range which required a whole lot of new firewall requests
### Discussion point 3
Accidentally added ~60 machines to the apiserver pool instead of the node pool, etcd got REALLY angry and collapsed under its own weight. That day, I learned two things: - The workloads will largely continue to operate in their last-known state for a surprisingly long time if the control plane goes down. Nothing can recover or move, but they'll keep chugging along in place. - If you shut down all but one member of the pre-change apiservers, you can hand etcd its own data directory as a backup/snapshot and it'll happily restore the cluster data _without_ etcd membership, then rejoin the other members that you want in the etcd cluster.
### Discussion point 4
DockerHub rate limits are a major chicken and egg.
### Discussion point 5
Windows and Kubernetes is one of the circles of hell Dante was talking about
## Thread snapshot
- Original subreddit: r/kubernetes
- Original author: u/Apple_Cidar
- Reddit score: 121
- Comment count: 86
- Original thread: https://www.reddit.com/r/kubernetes/comments/1r7t6lv/what_actually_goes_wrong_in_kubernetes_production/
Keep Exploring
CVE-2026-22039: How an admission controller vulnerability turned Kubernetes namespaces into a security illusion
Just saw this nasty Kyverno CVE that's a perfect example of why I'm skeptical of admission controllers with god-mode RBAC.
New policy: Sharing new Kubernetes tools must be in the weekly thread
Hi all.
What Kubernetes feature looked great on paper but hurt you in prod?
there are features in Kubernetes that look amazing on paper.
Looking for a replacement for Minio? S3 made easy with Garage
**Update: garage-operator v0.1.x released — Kubernetes operator for Garage (self-hosted S3 storage)** About a month ago I shared a project I’ve been building: a Kubernetes opera...