Back to Blog
    Storage
    Tape
    VAST Data
    NetApp
    Pure Storage
    Dell

    The 120PB Storage Question That Makes Every “Just Use Open Source” Answer Feel Suddenly Expensive

    January 14, 2026
    15 min read read
    # The 120PB Storage Question That Makes Every “Just Use Open Source” Answer Feel Suddenly Expensive There are storage questions, and then there are storage questions that make everyone in the room sit up straighter. One hundred and twenty petabytes for HPC and AI in support of quantitative research is the second kind. This is not a “which NAS should I buy?” problem. This is not a homelab fantasy with too many hard drives and not enough USB ports, though people made those jokes because storage people cope with terror through sarcasm. This is a program. A physical footprint. A support contract. A procurement war. A three-year bet with roughly £30 million behind it. And if it goes wrong, it won’t be a small migration. It’ll be a career-defining incident with racks. The core question sounds deceptively simple: go with DDN because an NVIDIA contact recommended it as cost-effective at this scale, or try to save money with open-source Lustre and maybe DeepSeek’s 3FS, which colleagues in Hong Kong and Germany had recommended for AI workloads. That’s a reasonable question on paper. At 120PB, though, “reasonable” gets eaten alive by operations. The real choice is not product versus product. It is whether the organization wants to buy a supported outcome, build a storage engineering team, or accidentally pretend those are the same thing. ## At 120PB, the storage platform is no longer the product — the organization is The most important number in the whole discussion is not 120PB. It is three years. That timeframe changes the texture of the decision. Anyone can draw a grand architecture on a whiteboard. Fewer teams can keep that architecture healthy through firmware, failures, capacity expansion, performance tuning, metadata pressure, user behavior, network changes, security reviews, audits, vendor escalations, and the slow grind of researchers who only care that their jobs finish faster. At this scale, storage stops being a box. It becomes an operating model. Open-source Lustre can absolutely be powerful. It exists in serious HPC environments for a reason. But “save a bit with OS Lustre” is one of those phrases that needs a warning label. Saving money on licensing or vendor packaging can shift cost into people, process, testing, and risk. That can be a great trade if the organization has the team for it. It can be disastrous if leadership thinks “open source” means “cheap enterprise storage without enterprise staffing.” One commenter distilled the cynical version of scale-out file systems into a joke: it’s just taping together NAS servers. That’s unfair to real distributed storage engineering, but the joke lands because bad scale-out designs often feel exactly like that. A bunch of nodes, a brave diagram, and a prayer that the metadata layer doesn’t become a haunted house. The uncomfortable question is not whether Lustre works. It does. The question is whether this specific organization is ready to own Lustre at 120PB with AI and quantitative research users breathing down its neck. ## DDN is the obvious HPC answer, which makes the horror stories hit harder DDN’s name naturally appears in this kind of conversation. HPC storage, AI training pipelines, large scientific workloads, big parallel file systems — this is the world where DDN has long been part of the vocabulary. An NVIDIA contact recommending DDN is not surprising. At that scale, a packaged, supported platform that understands GPU-fed workloads and can keep the storage side from becoming the bottleneck has a clean argument: spend enough on storage to make the compute useful, but not so much that the budget gets robbed from processing and networking. That’s the optimistic case. Then the comments kicked the door open. One person described DDN as horribly unreliable and told a brutal story about a professional services person being onsite for days after an outage, only to be laid off mid-repair. Another said every interaction they’d had with DDN was catastrophic, including business-critical SLA breaches. That kind of anecdote is not a benchmark. It is not a lab result. But storage buyers ignore stories like that at their peril because support experience is part of the product. At 120PB, nobody buys pure technology. They buy escalation paths. They buy spare-part logistics. They buy field expertise. They buy the confidence that when something weird happens on a Friday night, the vendor does not become another system that needs debugging. The fair version is that every major storage vendor has horror stories. NetApp has them. Dell has them. IBM has them. Pure has them. VAST has them. Open-source deployments definitely have them. But DDN’s risk in this thread was not technical skepticism alone. It was trust damage. If your shortlist starts with “NVIDIA recommended them,” and the first replies include “we ripped it out after contract expiry,” the evaluation has to get much sharper. ## NetApp walked into the debate wearing a crown and a question mark A fascinating side argument formed around NetApp. Some people pushed it as the safer enterprise choice. One commenter said “NetApp is the way” after agreeing with DDN criticism. Another defended NetApp hard, arguing that ONTAP has decades of development behind it and that premium enterprise storage can be worth paying for when security, governance, and research data matter. They framed the alternative as buying cheaper systems that require constant tinkering and bring “patch Tuesdays” into the storage world. That’s the classic NetApp argument: maturity matters. Governance matters. Security matters. Operational polish matters. Not everything should be a race to the bottom when the data is valuable. But the counterargument was sharp too. Someone questioned whether NetApp really fits 120PB HPC/AI scale, especially around NFS 4.2, pNFS, FlexFiles, and metadata services. Another flatly said, “Not with 120PB.” Others debated whether NetApp had true scale-out NAS, whether AFX changed the picture, and whether a new product can be called mature just because the company behind it is mature. That last point is important. Vendor maturity and product maturity are not the same thing. A company can have 30 years of engineering history and still ship a new architecture that deserves careful proof. A product can inherit ideas from mature systems and still behave differently at scale. Buyers love brand comfort, but physics and metadata do not care about brand comfort. NetApp may be a serious option for enterprise data management. It may even be the right option for certain parts of a 120PB estate. But for HPC/AI scratch, training data, and parallel access at this scale, “NetApp is enterprise” is not enough. The proof has to be workload-specific. ## The startup problem is really a trust problem The thread also carried a strong anti-startup tone. VAST and Hammerspace were treated with suspicion by some commenters. One person said they would trust NetApp over a startup running on white-box systems any day. Another predicted Hammerspace would be gone or acquired within five years. These comments may sound dismissive, but they point to a real procurement fear: what happens if the vendor story changes before the data lifecycle ends? At 120PB, vendor survivability matters. Migration is not easy. Exit costs are brutal. Data gravity is not a metaphor anymore; it is a physical and economic force. If a platform becomes strategically wrong, moving away from it can require years, temporary duplicate capacity, network planning, downtime windows, application changes, and a level of project discipline that makes everyone miserable. That said, dismissing newer platforms simply because they are newer can also be lazy. Some startup-era architectures exist because older systems were not built for AI-era access patterns. VAST, for example, gets attention in big-data and AI conversations because it tries to collapse some traditional storage tradeoffs around performance, capacity, and namespace design. Hammerspace gets attention because global data orchestration and metadata-driven access are real problems. The question is not whether newer vendors are automatically reckless. The question is whether they can prove support, roadmap, economics, and failure behavior at the required scale. For a 120PB research environment, the safest answer may not be the oldest vendor or the newest architecture. It may be the one that can survive the organization’s actual workload and still answer the phone intelligently when something breaks. ## DeepSeek’s 3FS is exciting, but excitement is not a storage strategy The mention of DeepSeek’s 3FS gives the whole debate a newer AI flavor. AI teams love fresh infrastructure ideas because model training and data pipelines expose pain fast. If colleagues in Hong Kong and Germany are recommending 3FS, that is worth investigating. But at 120PB, “colleagues recommended it” should start a lab evaluation, not end a procurement process. Emerging file systems can look incredible in the environment they were built for. Then they meet a different organization’s security model, user base, scheduler, network, compliance process, support expectations, upgrade cadence, and failure modes. Suddenly the elegant architecture needs documentation, tooling, backup integration, observability, quotas, lifecycle controls, recovery procedures, and people who understand it deeply enough to fix it when the original authors are asleep or unreachable. This is not an argument against 3FS. It is an argument against treating production research data like a science project unless the organization explicitly wants to become part of that science project. Open-source and emerging systems work best when the team has enough internal engineering strength to be a real participant, not just a consumer. That means reading code, understanding failure modes, contributing fixes, building automation, designing monitoring, and accepting that support may not look like a traditional vendor escalation. If the organization wants that level of control, great. If it wants an appliance-like experience at 120PB, reality will be less kind. The cheapest software can become the most expensive system if the team has to learn it during an outage. ## The jokes were absurd because the scale is absurd The comment thread also did what technical forums always do when a number gets ridiculous: it turned into comedy. Someone asked whether 120PB would be good for an adult collection. Someone joked about fitting many Linux ISOs. Another imagined thousands of USB-to-SATA adapters and the challenge of finding 4,000 USB ports. It’s silly, but it serves a purpose. Jokes are how people mentally process a storage request big enough to feel unreal. Underneath the jokes is a useful truth: 120PB is not just “a lot of disks.” It is power, cooling, floor space, networking, rebuild time, failure rates, spares, firmware domains, data protection strategy, namespace design, client behavior, monitoring, and budget politics. At this size, even tiny percentages become huge. One percent of 120PB is 1.2PB. A migration mistake is not a folder problem. A performance bottleneck is not one angry user. A bad procurement assumption can waste millions. That is why consumer-style thinking collapses immediately. You cannot homelab your way into this class of system. You cannot “just add drives” without understanding rebuild math, rack density, power, network oversubscription, and operational staffing. You cannot choose based on a single vendor recommendation, a single horror story, or a single favorite open-source project. The scale itself is the enemy. Every design choice becomes heavier. ## The real battle is support versus control The most useful way to frame the decision is not DDN versus Lustre versus 3FS versus NetApp versus VAST. It is support versus control. A vendor platform gives you a throat to choke, a tested configuration, support contracts, reference architectures, and usually some level of lifecycle discipline. It also gives you vendor lock-in, pricing pressure, roadmap dependency, and the possibility that support quality disappoints exactly when you need it most. An open-source or self-built approach gives you control, transparency, flexibility, and potentially better economics at hardware scale. It also makes your own team the escalation path. You own the integration. You own the weird bugs. You own the performance tuning. You own the consequences of under-staffing. Neither model is automatically smarter. But confusing them is deadly. If the organization has elite HPC storage engineers, strong Linux and parallel filesystem expertise, proper test environments, and the appetite to operate like a storage vendor internally, open-source Lustre or even emerging options may be viable. If the organization wants to focus on quantitative research and AI rather than becoming a storage product company, buying a supported platform starts making more sense, even at a premium. The expensive solution may be cheaper if it keeps researchers productive. The cheap solution may be better if the team can actually own it. The disaster is buying cheap while staffing like expensive support exists. ## The proof of concept needs to be cruel At 120PB, a normal proof of concept is not enough. Vendors are good at demos. File systems are good at happy paths. Research workloads are not happy paths. They are metadata storms, huge sequential reads, random access patterns, checkpoint bursts, small-file misery, model training pipelines, scratch cleanup disasters, and users who will absolutely find the weirdest possible way to abuse the namespace. The evaluation needs to be mean. Test the real workload. Test GPU starvation. Test metadata-heavy directories. Test checkpoint storms. Test mixed read/write pressure. Test client failures. Test rack loss. Test network congestion. Test rebuilds while users keep running jobs. Test software upgrades. Test quota behavior. Test snapshots or data protection if required. Test audit and governance. Test restore. Test what happens when the vendor’s first-line support gives the wrong answer. Ask every vendor for three references at similar scale and similar workload, not just logos. Ask about failed deployments. Ask what customers hate. Ask what happens when capacity doubles. Ask how pricing changes when the system needs more metadata performance instead of raw capacity. Ask what data mobility looks like if the platform has to be replaced. Ask how long a full migration would really take. A serious vendor should survive hard questions. A weak vendor will retreat into architecture diagrams. ## The budget is large, but not infinite Thirty million pounds over three years sounds huge until it meets 120PB of HPC/AI storage, networking, support, spares, power, facilities, and staffing. The storage purchase is only one slice. If the NVIDIA contact framed DDN as cost-effective because it frees more budget for processing and networks, that is a valid systems-level concern. AI infrastructure is a balance. Overspend on storage and the GPUs wait in a smaller cluster. Underspend on storage and the GPUs starve. Either way, expensive silicon sits around judging everyone. This is why cost per petabyte alone is too crude. The real metric is cost per useful research outcome. If a more expensive storage platform keeps accelerators fed, reduces outages, simplifies operations, and avoids hiring a small army of specialists, it may be cheaper in practice. If a cheaper open-source platform performs well and the team can operate it confidently, vendor premium may be wasted. The only wrong move is optimizing purchase price while ignoring operational cost. At 120PB, every inefficiency becomes a line item. Every outage becomes expensive. Every migration becomes political. Every under-designed network link becomes a bottleneck with a name and a meeting attached. The budget needs to buy capacity, yes. But more than that, it needs to buy confidence. ## There may not be one platform to rule it all The thread naturally treats “best storage” as a single answer, but the real architecture may be layered. HPC/AI environments often have different storage personalities: hot training data, scratch space, checkpoint storage, long-term research datasets, governance-controlled enterprise data, archive, backup, and replication. One product may not be ideal for all of them. DDN or Lustre-like systems may make sense for performance-heavy parallel workloads. NetApp may make sense for governed enterprise NAS or research data with heavy security and policy needs. Object storage may fit some pipelines. Tape or cold archive may still matter for long-term retention. VAST may fit certain high-performance unstructured use cases. 3FS may deserve a lab if AI-specific workloads align with its strengths. Trying to force every workload into one platform can simplify procurement while complicating life. Splitting platforms can improve fit while creating data movement and management pain. There is no free lunch, just different menus of regret. The key is data classification. What must be fast? What must be governed? What must be cheap? What must be retained? What must be shared globally? What can be regenerated? What is irreplaceable? What access patterns exist today, and which ones are likely to appear once users discover the system is faster? Storage architecture should follow data behavior, not vendor slogans. ## The sane answer is uncomfortable: buy expertise before buying petabytes The most responsible recommendation for a 120PB HPC/AI project is not “choose DDN” or “choose NetApp” or “build Lustre.” It is to make sure the organization has independent expertise before committing. Hire or contract people who have operated storage at this scale. Not people who have seen big numbers in slides. People who have dealt with failed OSTs, metadata bottlenecks, client storms, vendor escalations, bad firmware, and expensive systems behaving badly under real users. The organization should run a structured bake-off, but also a staffing bake-off. Who will operate this? Who will tune it? Who will patch it? Who will own user complaints? Who will understand when performance problems are storage, network, scheduler, GPU, application, or filesystem behavior? Who will manage capacity planning? Who will defend the architecture to finance? Who will design the exit path? If those answers are vague, the technology choice is premature. A £30 million storage program cannot be steered by vendor recommendations alone, even from NVIDIA. It also cannot be steered by forum horror stories alone, even when they sound terrifying. It needs workload evidence, reference customers, operational design, financial modeling, and a clear decision about whether the team wants a product or a platform they effectively co-own. ## The final decision is really about what kind of pain feels survivable DDN may still be the right answer if the workload is classic HPC/AI, the references are strong, the support contract has teeth, and the proof of concept shows real performance under ugly conditions. The horror stories mean the support model needs intense scrutiny, not automatic rejection. Open-source Lustre may be the right answer if the organization has the talent and appetite to run it seriously. The savings are real only if the operational model is real. DeepSeek’s 3FS may be worth testing, especially for AI-specific workflows, but it should earn production trust slowly. NetApp may be excellent for governed enterprise data and some scale-out needs, but any claim around 120PB HPC/AI performance has to be proven brutally, not assumed from brand maturity. VAST and other newer architectures may deserve a look, but vendor survivability, support, and exit cost need just as much attention as benchmark numbers. The hard truth is that 120PB doesn’t care about anyone’s favorite vendor. It doesn’t care about open-source ideology. It doesn’t care about brand loyalty, startup energy, or who had a bad support experience in 2021. It cares about failure domains, metadata, throughput, latency, rebuilds, people, power, cooling, networking, and whether the platform can still make sense when everyone is tired. At this scale, the question isn’t “what storage should we buy?” It’s “what storage failure mode are we willing to live with?” That’s the choice hidden inside every petabyte.