The DIY SAN Rabbit Hole: One Engineer Asked How Modern Storage Really Works, and the Answer Got Complicated Fast

There’s a certain kind of question that cuts through the usual enterprise storage noise. Not “which vendor is better?” Not “why is this license so expensive?” Not “why did the dashboard break after the firmware update?” This one was cleaner, sharper, and much more dangerous: how does a modern dual-controller SAN actually work under the hood?

A software engineer working at a cloud provider decided to build a small version of the thing vendors sell for serious money. The setup sounded familiar to anyone watching modern arrays evolve: two x86 platforms, direct 25Gb links, dual-ported NVMe drives, NVMe over TCP or RDMA, exposed namespaces, and a topology inspired by switchless HA systems like HPE Alletra MP B10000. The engineer had already played with Mellanox OFED and SPDK, and had managed to expose NVMe namespaces to hosts. Then came the hard part: multipath, RAID, high availability, fencing, failover, metadata, and all the invisible machinery that turns “some fast drives on a network” into a storage array people trust.

Everyone loves the architecture diagram until consistency shows up

The seductive part of modern storage is that the physical design looks almost understandable. Two controllers. Shared drives. Fast fabric. Host multipathing. A management layer. Some kind of distributed or mirrored state. A clean diagram makes it all look like Lego for adults with more expensive cables. But that diagram hides the actual problem: storage is not about moving bytes quickly. It’s about moving bytes correctly, while things are failing, lying, racing, timing out, rebooting, and trying to corrupt your weekend.

That’s where the comments got interesting. One person immediately pushed back on the assumption that commercial arrays are just Linux underneath with a management UI bolted on. That may be true sometimes, but not universally. Some vendors lean on BSD because of licensing history. Others use BSD only for bootstrapping while the heavy work bypasses the traditional OS path, more like how ESX handles certain low-level functions. Another commenter with storage industry background said BSD is surprisingly common, and that systems like Hitachi VSP and Dell PowerMax are not simply Linux-driver machines doing normal Linux things.

That correction matters. People often imagine commercial storage as open-source plumbing with a pretty face and a support contract. Sometimes it is. But high-end arrays often carry decades of custom engineering, proprietary microcode, strange hardware assumptions, and specialized data paths that don’t map neatly onto a standard Linux mental model.

The first brutal lesson: active-active is not a weekend project

The most practical reply cut straight to the heart of the DIY dream. If someone wants to build a Linux-based SAN array, the likely building blocks are device-mapper for RAID or volume mapping, some custom metadata to describe disk layout, maybe clustered LVM, and then Pacemaker plus Corosync for HA coordination. For fencing, persistent reservations on the drives become important. That stack might get you an active-passive array. It probably won’t get you a real active-active storage system using only off-the-shelf pieces.

That is a huge distinction.

Active-passive is already hard. One controller owns the resources. The other waits, monitors, and takes over when the first one dies or gets fenced. There is plenty of engineering pain there: failover timing, split-brain prevention, state replication, write ordering, reservation handling, client path behavior, and recovery sequencing. But at least the ownership model is relatively simple.

Active-active sounds cleaner from a marketing standpoint. Both controllers serve I/O. Both appear useful. Both contribute. Everything feels more modern. But true active-active means both sides must agree constantly about ownership, ordering, cache state, metadata, dirty writes, reservations, rebuilds, and failure handling. If both controllers can write to the same logical structure, concurrency becomes the monster in the room.

One commenter put it plainly: most arrays don’t do true active-active anyway. That line deserves more attention than it usually gets. A lot of enterprise storage marketing uses active-active language loosely. Sometimes it means both controllers are powered and serving different volumes. Sometimes it means optimized paths. Sometimes it means ALUA behavior. Sometimes it means both controllers can access back-end media, but not necessarily that both are writing the same data structures at the same time in a fully symmetric way.

The words are simple.

The reality is not.

The boring pieces are where the bodies are buried

Storage engineers obsess over boring things because boring things destroy data. Metadata. Superblocks. Checksums. Write barriers. Flush semantics. Reservations. Fencing. Generation counters. Split-brain detection. Stable identifiers. Import rules. Replay logs. Dirty region tracking. These sound like implementation details until one controller dies halfway through an update and the surviving controller has to decide what reality means.

That’s why one comment warned the original builder to think deeply about the concurrency model and layers of data integrity. Drives and SSDs can lie about writes, misbehave around flushes, or do things that are technically allowed but horrifying if the upper layers assume perfect honesty. The advice was to do a deep dive into ZFS, not necessarily because ZFS is the exact answer for building a commercial NVMe-oF array, but because it teaches the kind of paranoia good storage systems need.

That’s the difference between a demo and an array.

A demo exposes a namespace.

An array survives bad days.

A demo passes a benchmark.

An array knows what to do when a disk returns nonsense, a controller disappears, a host retries aggressively, an interconnect flaps, and a write was acknowledged right before the world split in half.

People underestimate that gap because the happy path looks so clean. SPDK can move I/O fast. Linux has NVMe target support. RDMA can scream. Multipath can make clients see more than one route. Pacemaker can move resources. Device-mapper can build block mappings. All of those are real. None of them automatically produce a trustworthy storage platform.

You still need the opinionated brain tying them together.

Open-source parts exist, but the product is the glue

Several commenters pointed toward real software pieces. Modern Linux includes NVMe target support through nvmet. Corosync is common in HA stacks. RSF-1 appears in commercial HA storage environments. Pacemaker can coordinate failover. Device-mapper can present block devices and underpin volume logic. dmsetup can manipulate those mappings directly, or a builder could create a custom volume manager through APIs, the way LVM does.

That sounds like a list.

It is not a solution.

The actual product is the glue.

How does the system decide which controller owns a namespace? How does it publish that state to hosts? How does it coordinate reservations on dual-ported NVMe drives? How does it handle controller death versus network partition? What happens if the HA link fails but both controllers are still alive? Who fences whom? What if fencing fails? What if both controllers think the other is dead? What metadata exists on disk, who can update it, and how is it versioned? How does rebuild work while I/O continues? Can one side serve stale data? How do you prove it cannot?

This is why one commenter said a whole software stack is needed and that it is not plug-and-play. You have to mesh a bunch of things to get the result. That’s understated, but accurate. Storage systems are less like single applications and more like treaties between hostile layers that barely trust each other.

The open-source world gives you powerful ingredients.

Commercial arrays make a meal out of them.

Sometimes with open parts. Sometimes with proprietary parts. Sometimes with custom kernel paths, user-space I/O engines, private metadata formats, and years of failure testing that never makes it into the glossy brochure.

Commercial vendors won’t hand over the secret sauce, and that’s the point

One reply said the quiet part out loud: if the question is what exact software a commercial NVMe-oF array uses under the hood, that’s confidential, and anyone who knows probably won’t share it. That’s not being rude. That’s the business. The parts that make an enterprise array valuable are often the boring, battle-tested internals no vendor wants cloned.

It’s not just UI.

It’s not just “Linux plus drivers.”

It’s behavior.

Failover behavior. Upgrade behavior. Recovery behavior. Dirty cache behavior. Reservation behavior. Management-plane behavior. Telemetry behavior. Supportability behavior. The stuff customers only notice when it fails.

The original engineer’s instinct was still good. Rebuilding a simplified version is one of the best ways to learn. But the lesson from the thread is that modern SAN design is not a shopping list of components. It’s a set of brutally specific decisions about failure. Every architecture has to answer the same questions. What can fail? Who notices? Who decides? Who stops writing? Who keeps serving? Who rebuilds truth afterward?

Bad systems avoid those questions.

Good systems encode answers.

Great systems prove those answers under ugly conditions.

The Linux-versus-BSD debate hides a bigger truth

The argument over whether storage appliances use Linux, BSD, or something more exotic is interesting, but it can become a distraction. The OS matters for licensing, drivers, boot process, tooling, and ecosystem. But in serious storage arrays, the central design question is not “which OS boots?” It is “where does the authoritative data path live?”

Some systems lean heavily on OS storage stacks. Others bypass much of the kernel path for performance, control, or historical reasons. Some expose management through small Unix-like environments while the real I/O path runs in custom modules. Some use virtualization internally for services. One commenter mentioned bhyve being used by vendors to ship management, UI, and data-management functions in small BSD images. Another found it funny because NetApp created bhyve and apparently didn’t end up using it the way others did.

That little side path says a lot. Storage vendors are pragmatic. They reuse what fits. They invent what they must. They carry old decisions forward for decades because rewriting working storage code is how adults discover fear.

A cloud engineer looking at modern NVMe storage may expect something new and clean. In reality, the newest arrays often contain old ideas wearing faster fabrics. Reservations, ownership, fencing, metadata journaling, checksums, cache coherency, failover state machines, and multipath policies are not new problems. NVMe just makes everything faster, including the consequences of getting it wrong.

Multipath is not magic; it’s a contract

Multipath often gets treated like a client-side convenience. Multiple paths from hosts to storage devices. Better resilience. Maybe better balancing. But in a dual-controller array, multipath is part of a larger contract between host, fabric, controller ownership, and failover semantics.

The host needs to know which paths are usable. The array needs to expose paths consistently. During failure, paths need to disappear, degrade, or change state in ways the host understands. If a controller fails over a namespace, the host cannot be left writing to a dead path forever or, worse, writing somewhere unsafe. With NVMe-oF, ANA behavior plays a role similar in spirit to path state handling in older multipath worlds, but the larger question remains the same: how do clients know where truth lives right now?

That question ties directly into HA.

HA is not “two boxes connected together.” It is config synchronization, state synchronization, failure-domain clarity, and a fencing model strong enough to stop split brain. One commenter defined it in exactly that direction: HA means config and state sync with clear failure domains. That may sound simple, but it’s the sentence every DIY array eventually crashes into.

Two controllers are easy.

One coherent storage personality is hard.

This is why storage engineers are paranoid for a living

The thread was refreshing because it treated storage as engineering rather than brand warfare. Someone even joked that it was an actually interesting storage post and deserved ten upvotes. That reaction makes sense. The question went straight into the machine room of the industry: what is underneath the shiny arrays?

The answer is part open technology, part proprietary history, part distributed-systems pain, and part very expensive paranoia.

A DIY modern SAN experiment can absolutely teach a lot. Start active-passive. Use nvmet or SPDK for target exposure. Study device-mapper, LVM metadata, persistent reservations, Pacemaker, Corosync, and ZFS. Learn how hosts behave when paths flap. Pull cables. Kill controllers. Corrupt metadata in a lab. Watch what happens when assumptions break. That’s where the real education begins.

But the deeper truth is humbling. Enterprise storage arrays are not expensive only because vendors like margins, though they certainly do. They are expensive because the boring parts are hard, the failure matrix is huge, and customers expect storage to be the one layer that does not blink.

Modern NVMe fabrics changed the speed.

They did not remove the old demons.

They just gave them lower latency.

Subscribe our newsletter