Back to Blog
    GitOps
    ArgoCD
    Kubernetes
    DevOps
    SRE
    Production

    When GitOps Meets Emergency Fixes: ArgoCD Operational Lessons

    October 20, 2025
    9 min read read
    GitOps sounds pristine on paper. You define your entire infrastructure and deployments in Git, your pipelines handle the rest, and tools like ArgoCD keep everything in your cluster aligned to the version-controlled truth. It's the kind of DevOps dream that gets keynote time at every cloud-native conference. But real-world infrastructure isn't a keynote. It's a graveyard of trade-offs, late-night alerts, and fire drills. And when GitOps breaks bad — usually at 2AM — it's not the senior engineer giving the talk at KubeCon who's waking up. It's the junior SRE sweating over kubectl edit, hoping their changes stick before ArgoCD notices. This isn't theoretical. It's painfully, hilariously real — and the internet has stories. ## GitOps vs. Prod Fires: Who Wins? One of the more common frustrations with GitOps workflows is that they're rigid by design. That's the point — we want to eliminate manual intervention, reduce drift, and ensure repeatability. But sometimes, repeatability is the enemy. Imagine this: production is on fire. The pipeline is slow. Approval queues are jammed. And ArgoCD, the loyal enforcer of your declared state, keeps reverting your emergency fixes because they weren't done "the GitOps way." That's what happened to one junior SRE who got paged in the middle of the night. Faced with a deployment issue, they used kubectl edit to try and patch things up — only to watch ArgoCD reset their changes every few minutes. The kicker? The only person with access to the ArgoCD platform was out. The result? Eight hours of downtime because nobody could stop the automation from undoing the emergency work. It's not just one-off stories. This happens more often than most teams want to admit. ## Drift Detection: Savior or Saboteur? The idea behind drift detection is noble: any changes made outside of Git should be flagged or reverted. But in emergency scenarios, this becomes a double-edged sword. As one engineer put it, "Sometimes you just gotta put the fire out." Yet to do that, many teams have to either: Temporarily suspend drift detection, fix the issue, and then re-enable it after the PR gets merged (if it ever does), or Hack their way into the system with breakglass permissions (if they exist at all). And even then, you're racing the ArgoCD reconciliation loop. In some setups, it'll undo your fix in milliseconds. In others, you might have a three-minute window before it reverts your cluster back to broken. ## The ArgoCD Catch-22 Let's talk about ArgoCD, because it shows up in almost every one of these war stories. It's a solid tool. It does what it promises. But it doesn't care that your approval chain is asleep or that your CI pipeline takes forever. It will enforce the last declared state, no matter how out of date it is. This has led to some... creative solutions: Turning off auto-sync by editing the ArgoCD application object manually (kubectl edit app argocd -n argocd) Redirecting ArgoCD to a temporary PR branch to bypass the slow merge process Just deleting the ArgoCD server pod entirely, hoping that when it comes back, it's learned its lesson (spoiler: it hasn't) These aren't best practices. They're desperation tactics. But when prod is down, elegance takes a backseat to uptime. ## Why Do Juniors Have Prod Access? One of the most hotly debated aspects of this mess is access control. Why does a junior SRE have production permissions in the first place? Is this a badge of trust, or a sign that your on-call rotation needs serious restructuring? Opinions vary. Some believe juniors should never be first responders for critical incidents. Others argue that incident response is where you really learn — provided there's a senior available to shadow or lead. The problem isn't always the access — it's the lack of support structure. Giving a junior SRE keys to prod without guardrails, guidance, or a clear breakglass process isn't empowering — it's negligent. ## What a Good Breakglass Setup Actually Looks Like Breakglass access isn't a controversial idea. It's supposed to exist for situations exactly like this. But it has to be designed right: **Auditable:** Every use is logged and reviewed. **Temporary:** Access expires automatically. **Justified:** Every use must come with a reason — and often, a postmortem. **Accessible:** It shouldn't require waking five people or waiting an hour to use. If your "emergency access" takes longer to unlock than the PR pipeline, it's not a solution — it's a trap. One team described using AWS IAM roles where you can assume a special "breakglass" role via SSO, but only after jumping through an extra confirmation step. It's quick, it's visible, and it forces engineers to think twice before making changes — without blocking them when minutes matter. ## Is GitOps Actually the Problem? Here's where it gets interesting. A lot of these issues don't come from GitOps itself, but from bad implementation. If your GitOps flow is too slow to be useful during an incident, that's a failure of the pipeline design — not the paradigm. If you're relying on a single engineer for ArgoCD access, that's a people/process problem, not a tool limitation. And if engineers feel they have to break process just to fix things quickly, it probably means your process wasn't built for real-world chaos in the first place. ## Lessons From the Front Lines The engineers in these situations aren't clueless — they're resourceful. They know that saving the system sometimes means bending the rules. But they're also frustrated that they have to. Here's what teams could actually do to reduce GitOps pain in production: **Document the emergency playbook.** Don't assume everyone knows how to pause ArgoCD or reroute a sync. Write it down. **Create fast paths for prod hotfixes.** If your PR takes hours to merge, you don't have continuous deployment — you have continuous frustration. **Set up webhooks in ArgoCD.** Stop relying on polling every 3 minutes. Push mode is faster, less confusing, and more efficient. **Rotate access.** Don't bottleneck critical permissions with one person. If you need someone to edit Argo at 2AM, they shouldn't be unreachable. And most importantly: **Have postmortems.** If something breaks bad, make sure the next 2AM responder has more than hope to work with. ## The Bottom Line GitOps isn't broken. But your team might be. Automation is only as good as the humans behind it. And when the humans are tired, under-trained, or unsupported, the best tools in the world can't save you. So yes, ArgoCD will auto-heal. But it won't heal your broken processes, your lack of documentation, or the absence of a support net for your most junior engineers. Fix that, and GitOps doesn't break bad — it just works.