Mr.PlanB

**“We Thought Kubernetes Would Save Us” — The Production Failures No One Puts on the Conference Slides** Someone asked a simple question: What actually goes wrong in Kubernetes production? Not best practices. Not theory. Not whiteboard architecture. Real failures. The answers weren’t polished. They were scars. And if you read between the lines, you start to see a pattern: Kubernetes doesn’t usually fail in dramatic, cinematic ways. It fails in deeply human ways. Let’s talk about what actually breaks. --- # 1️⃣ The Control Plane Doesn’t Die Quietly — It Implodes One engineer accidentally added ~60 machines to the **API server pool** instead of the node pool. etcd “got REALLY angry and collapsed under its own weight” . That’s not a bug. That’s operational blast radius. But here’s the eerie part: workloads kept running. They didn’t reschedule. They didn’t heal. They didn’t scale. But they kept serving traffic in their last-known state . That’s something most people don’t understand until they see it live: Kubernetes is split-brain tolerant in a weird way. The data plane keeps chugging even if the control plane is on fire. Until it doesn’t. Recovery involved manually restoring etcd from its own data directory and rejoining members . That’s not a fun Tuesday. --- # 2️⃣ IP Address Exhaustion — The Silent Killer “Subnet size for k8s created too small” . That one sentence hides weeks of pain. Clusters run out of IPs. Pods fail to schedule. Nodes scale but can’t attach ENIs. Firewall rules need rewriting. Networking teams get dragged into meetings. Another comment bluntly said: > Not using IPv6 is the first mistake. Was it sarcastic? Maybe. But the pain is real. Someone else mentioned disabling warm ENI allocation in EKS freed thousands of IPs . That’s the kind of thing you only learn after watching nodes fail to scale during an incident. IP math is not sexy. Until production goes read-only. --- # 3️⃣ DockerHub Rate Limits: The Self-Inflicted DDoS “DockerHub rate limits are a major chicken and egg.” Here’s what happens: - You scale nodes. - Nodes pull images. - DockerHub throttles you. - Pods fail. - Autoscaler adds more nodes. - They also fail. You accidentally DDoS your own supply chain. One engineer mentioned DDoSing their internal container registry during node pool rollout . That’s the thing about Kubernetes: It amplifies mistakes. Rollouts are multiplicative events. If you haven’t implemented: - Private registry - Pull-through cache - ECR/ACR/GCR mirror - Harbor - Image swapper You’ll eventually learn the hard way. --- # 4️⃣ etcd Is Small… Until It Isn’t etcd rarely shows up in architecture diagrams as the villain. But it’s the heart. If it’s slow: - The control plane feels slow. - Scheduling delays happen. - API calls hang. If it’s overloaded: - You’re restoring from snapshots. - You’re praying. Most teams don’t monitor etcd deeply until after their first outage. That’s the theme here. --- # 5️⃣ Capacity Planning Isn’t Optional Someone casually admitted: > Bad capacity planning from our side. That’s not rare. Clusters are built for today’s scale. Six months later: - More namespaces - More services - More pods - More IP usage - More control plane load Kubernetes doesn’t fail loudly when it approaches limits. It degrades. And degradation is harder to diagnose than explosions. --- # 6️⃣ The Registry Lesson: Cache Everything One response mentioned adding image caching to ECR after getting burned . Another said installing Harbor helped massively . The pattern is clear: External dependencies become internal single points of failure at scale. Your cluster might be healthy. Your registry might not be. And Kubernetes doesn’t care whose fault it is. It just reports “ImagePullBackOff.” --- # 7️⃣ IPv4 Assumptions There’s a reason someone said: > If you're deploying a cluster and you think “a /16 won’t be enough” then yes IPv6 Most teams assume: - /24 is enough - /22 is generous - /16 is massive Until: - Pod-per-node density increases - Secondary ENIs allocate - Warm pools reserve IPs - Sidecars double pod count Networking is where “cloud-native” optimism meets physical limits. --- # 8️⃣ The Real Horror: It’s Usually Not Kubernetes Here’s what stands out reading all this: The horror stories aren’t about Kubernetes bugs. They’re about: - Misconfiguration - Over-scaling control planes - Under-sizing subnets - Registry dependencies - Capacity blind spots Kubernetes mostly did what it was told. The humans told it the wrong thing. --- # 9️⃣ Observability Gaps The original question asked about observability gaps . Notice something? Most failures described weren’t application issues. They were: - Networking constraints - Control plane collapse - Infrastructure bottlenecks - Registry throttling These don’t show up in your APM dashboard. They show up in: - etcd metrics - Cloud subnet utilization - ENI allocations - Image pull latency - API server saturation If you’re only watching pod CPU and memory, you’re blind. --- # 10️⃣ The Big Lesson Nobody Likes Kubernetes isn’t fragile. It’s powerful. But power multiplies blast radius. A small mistake at scale becomes: - 60 misconfigured API servers - Thousands of IPs exhausted - Registry meltdown The system is deterministic. The outcomes are not. --- # What Actually Goes Wrong? If you distill the thread down: 1. **Control plane overload** 2. **IP/subnet exhaustion** 3. **Registry bottlenecks** 4. **Capacity miscalculations** 5. **Networking assumptions** 6. **Overconfident rollouts** Not YAML indentation. Not container crashes. Infrastructure math. --- # The Quiet Truth Kubernetes in production doesn’t usually fail because of some exotic zero-day exploit or scheduler bug. It fails because: - Someone assumed the cluster wouldn’t grow that fast. - Someone underestimated IP math. - Someone scaled node pools without caching images. - Someone added machines to the wrong pool. - Someone didn’t model worst-case rollouts. The cluster didn’t betray them. It amplified them. --- And maybe that’s why one commenter called them “horror short stories for K8s folks” . Because once you’ve seen one of these incidents live… You never look at a simple `kubectl apply` the same way again.

We Thought Kubernetes Would Save Us - The Production Failures No One Puts on the Conference Slides

Keep Exploring