Mr.PlanB

It started with a single command — a simple curl request that was supposed to pull a few harmless stats. Instead, it brought an entire Ceph cluster to its knees. A home lab admin, tinkering with Ceph's RESTful API, sent what should've been a routine call: ```bash curl -k -X POST "https://USERNAME:API_KEY@HOSTNAME:PORT/request?wait=1" -d '{"prefix": "df", "detail": 1}' ``` That line doesn't look dangerous. It's the kind of one-liner anyone running Ceph has probably used or tested in some form — a quick peek at pool stats, something you could usually pull with `ceph df detail`. But this time, that tiny detail parameter — `"detail": 1` — triggered something far worse than a parsing error. Within seconds, the monitor services (ceph-mon) crashed across the entire cluster. Every VM tied to the system froze. The Proxmox dashboard went red. Storage daemons screamed in silence. The cluster — the beating heart of a self-hosted infrastructure — was dead. And all because of one malformed API request. ## The day the monitors died At first, it looked like a minor hiccup. The admin, who goes by packetsar, noticed Ceph's monitors throwing errors. But instead of recovering, they crashed — and kept crashing. Logs filled with a hauntingly specific message: ``` ceph-mon[278661]: terminate called after throwing an instance of 'ceph::common::bad_cmd_get' ceph-mon[278661]: what(): bad or missing field 'detail' ``` Anyone who's wrangled Ceph knows that when the monitors (MONs) fail, it's not a small problem. MONs keep track of the cluster map — they know which OSDs hold what data, where pools live, who's in quorum. When they're offline, the cluster effectively loses its memory. But this wasn't just a single node crash. It cascaded across all monitors. Each one picked up the poisoned request from a shared database and followed it right off a cliff. Every attempt to restart just repeated the failure. "Reboots, service restarts — nothing worked," wrote packetsar. "Ceph and Proxmox cluster are hard down and VMs have stopped at this point." ## The self-inflicted poison pill The weirdest part? The command shouldn't have been that harmful. Ceph's API is supposed to validate input — to gracefully handle malformed JSON or unknown fields. Instead, the `"detail": 1` parameter got interpreted in a way that caused a fatal exception inside the MON process. Essentially, Ceph ingested the bad request, choked on it, and then saved it — replaying the same invalid command each time the monitors tried to restart. It's like a crash loop cycle: the cluster remembered the bad command and kept feeding it back to itself. "Looks like that request got put in a shared ceph-mon database and caused all the monitor services to crash," the admin wrote. It wasn't just one bad command; it was persistent corruption. The poisoned state lived inside the monitor store, meaning even clean restarts couldn't shake it. One comment summed it up perfectly: *"Yet another example of bad input validation? Don't forget to file this as a bug towards Ceph."* It's a simple but brutal truth: when you build distributed systems, input validation isn't optional. One unchecked field can bring everything down. ## "You can poison the whole cluster with a single REST call" That's how packetsar described it later in the thread. It sounds hyperbolic — until you remember Ceph's architecture. Ceph is built around consensus. The monitors replicate data to maintain quorum. When one gets bad state data, it propagates it to others, assuming it's valid. So if a malformed command makes its way into the monitor database, it gets copied faithfully across nodes. The system trusts its peers — that's its strength and, in this case, its weakness. When someone asked whether the same command worked fine via CLI, the admin confirmed: yes, `ceph df detail` ran perfectly from the command line. It was the REST API layer that failed to validate properly. That distinction is critical because it means the problem wasn't with Ceph's logic, but its API wrapper — a layer meant to make automation safer and simpler. Instead, it became a single point of failure. ## Rebuilding from ashes Once the scope of the failure sank in, the real question was: how do you bring a dead Ceph cluster back when your monitors won't even start? There's no magic "undo" button for Ceph's monitor database. So packetsar had to go old-school — rebuilding it manually from the OSDs (Object Storage Daemons). It's a method outlined deep in Ceph documentation, usually reserved for extreme corruption scenarios. Here's the recovery process, simplified: 1. Shut down all OSDs across every host to stop further writes. 2. Pick one host to start with and rebuild a fresh monitor database using data from its local OSDs. 3. Rsync that new store.db to other hosts, one by one, rebuilding and merging as you go. 4. Once the database is complete, replace the production store.db on all monitors with the new one. 5. Bring up the MONs again and let them reach quorum. 6. Finally, rebuild the manager (mgr) daemons one at a time and reconfigure settings like the REST API module. That's not something you do lightly. It's slow, nerve-wracking work — especially when your storage cluster is the foundation of your entire virtualization setup. But it worked. "I was able to get VMs back online about an hour after I started slowly working through the rebuild process," he wrote. That's remarkable resilience — and a testament to how deeply open-source systems can be understood and repaired when you have full control and patience. ## Lessons from the meltdown If there's a moral to this story, it's not "don't tinker with your cluster." Tinkering is how you learn. But distributed systems, especially ones like Ceph, demand a particular respect. A few takeaways stand out from this debacle: ### 1. APIs aren't always safer We tend to think APIs abstract away risk — that using structured calls instead of raw commands makes automation safer. But this incident shows the opposite can be true if the API isn't validating inputs properly. Bad input through an API shouldn't take down the service it's meant to expose. That's development 101. ### 2. Persistence cuts both ways Ceph's design makes it resilient — it keeps data consistent across nodes, even through crashes. But that persistence also makes it unforgiving. If it stores a bad state, that bad state becomes gospel until manually purged. ### 3. Testing in production isn't testing It's tempting to test "just one small thing" in your live cluster, especially if you've done it a hundred times before. But a self-hosted lab isn't the same as a sandbox. One malformed command can turn an evening experiment into a 4 a.m. recovery session. ### 4. Documentation is survival What saved this cluster wasn't luck — it was documentation. The ability to rebuild a monitor store from OSDs isn't common knowledge. It's the kind of process buried in Ceph's lower-level docs, the stuff only desperate admins end up reading. But it's also what brought the cluster back. ## The quiet danger of silent failures This wasn't the first Ceph disaster story, and it won't be the last. In fact, others chimed in with eerily similar experiences — clusters refusing to come back after power outages, monitors corrupted beyond repair, or phantom configurations that refused to die. One user wrote: *"Managed to break Ceph by having an unexpected power outage… it never came back up and nothing I tried worked. Even starting from scratch kept bringing back old remnants. I sacked Ceph off and moved to ZFS with replication."* ZFS isn't distributed storage — it trades scalability for simplicity and reliability. That's telling. When a system's recovery path is harder than a full migration, that's a warning sign. Ceph is powerful, but it's not forgiving. It gives you incredible flexibility — object, block, and file storage in one platform — but it assumes you'll handle that power responsibly. ## "Public shaming might be the best medicine" One commenter half-joked that if the bug was reproducible, "the Ceph devs should be ashamed." And honestly, they have a point. This isn't just an obscure corner case. An authenticated REST call shouldn't be able to crash core services, period. Whether you're running a home lab or a data center, that kind of fragility undermines trust in the stack. Open-source software thrives on transparency. Bugs happen — but ones like this highlight the need for better testing around API input validation, particularly for admin-level commands. ## Aftermath and reflection The story has a happy ending — the cluster was revived, the VMs came back online, and the admin learned more about Ceph's internals in a few hours than most do in months. But it also left a mark. "It seems crazy that you can poison the whole cluster with a single REST call," he said. And he's right. Distributed systems are supposed to be resilient, but resilience doesn't mean invincibility. Sometimes, it means recoverability. And that's what this case demonstrates most clearly — not perfection, but the ability to claw your way back when everything goes wrong. ## The bigger picture There's a broader theme here about the fragility of automation in modern infrastructure. We build layers upon layers — APIs on top of daemons, daemons on top of databases — all to make management simpler. But each layer also introduces new complexity and new ways to fail. In cloud-native environments, that risk multiplies. Every service talks to another over APIs. Every command, every deployment script, every automation job carries the potential to trigger a cascade. And when something like Ceph — one of the most respected open-source storage systems in the world — can be brought down by a single malformed JSON field, it's a reminder that complexity always comes at a cost. ## Epilogue: One line to rule them all Somewhere in a terminal history sits that fateful command. It's small, unassuming — a line of text that fits in a tweet. But behind it is a story about trust, fragility, and resilience in distributed systems. About how even the most sophisticated software can fall apart from a tiny mistake. And about how, with the right knowledge, persistence, and documentation, it can be brought back to life again. Because sometimes, the difference between total data loss and recovery is just knowing which file to rsync next.

How One Bad API Call Took Down an Entire Ceph Cluster

Keep Exploring