Mr.PlanB

Every time a big company moves off a well-known stack, the internet reacts the same way. “Wait… why didn’t they just scale it?” That’s exactly what happened when people saw Tesla talk about moving metrics workloads to ClickHouse. The claim floating around was that “Prometheus doesn’t scale horizontally,” and the obvious counter-question showed up immediately: Why not just use Thanos or Cortex? It’s a fair question. And the answers are more interesting than the headline. --- ## First: Prometheus *Does* Scale — But With Design Choices Let’s kill one myth right away. Prometheus absolutely scales. Just not in the same way a distributed SQL database does. Out of the box, a single Prometheus server is vertically scalable. Add RAM. Add CPU. Tune retention. Control cardinality. At serious scale, you introduce horizontal patterns: - Sharding by functional domain - Federation - Thanos - Cortex - Mimir As one experienced operator bluntly put it: > Prometheus can handle 100 million cardinality. Thanos can handle billions. That’s not hobby scale. That’s real infrastructure. So the idea that “Prometheus doesn’t scale horizontally” isn’t quite accurate. It just doesn’t scale the same way a distributed OLAP database does. --- ## The Thanos Argument: Elastic and Object-Storage Native One of the strongest replies in the discussion came from someone running both Thanos and ClickHouse — and choosing each for different signals . Their setup: - Thanos for metrics - ClickHouse for logs, traces, errors That split is telling. Metrics workloads and log workloads are fundamentally different beasts. And they point out something important: ClickHouse scaling horizontally is arguably harder in some ways because each shard has local persistent disk that needs care. Changing shard count is painful . That’s a very real operational burden. With Thanos? - Query components can scale horizontally. - Store nodes scale with object storage. - Deployment and StatefulSets make scaling straightforward. - Data is sharded naturally via S3 object storage. In other words: scaling Thanos is mostly orchestration and object storage. Scaling ClickHouse is shard math and disk planning. Different trade-offs. --- ## So Why Would Tesla Choose ClickHouse? Now we get into the interesting part. Because it’s probably not about “Prometheus can’t scale.” It’s about what kind of scale they needed. ClickHouse is an analytical database. It’s optimized for: - Massive columnar datasets - Complex aggregations - Ad-hoc queries - Long retention - Multi-signal analysis Prometheus (even with Thanos) is optimized for: - Time-series metrics - Operational queries - Real-time alerting - High-ingest telemetry - PromQL semantics Those goals overlap — but they’re not identical. If Tesla wanted: - Cross-analysis of metrics with other telemetry - Deep historical analytics - Arbitrary SQL-style aggregations - Very long retention at massive scale ClickHouse starts to look attractive. --- ## But Here’s the Catch: They Had to Rebuild PromQL One of the most interesting details in the discussion: Tesla reportedly introduced their own transpiler (Comet) from PromQL to SQL — in cooperation with the ClickHouse team . That detail changes the tone. If you move away from Prometheus storage but still want PromQL semantics, you have two options: 1. Abandon PromQL. 2. Recreate it. They chose to recreate it. That suggests something powerful: PromQL is hard to replace. Even if you change storage engines, the query model still matters. If ClickHouse were a drop-in Prometheus replacement, there’d be no need for a transpiler layer. The fact that one was built tells you this wasn’t trivial. --- ## High Cardinality Isn’t “Solved” Anywhere Another thread in the discussion hit a core issue: What about high cardinality? Isn’t that still a problem in Thanos or Cortex? Correct. High cardinality is not magically solved by switching backends. It’s a data modeling problem. If your metric design includes: - Unbounded labels - Per-user dimensions - Dynamic identifiers - Ephemeral workloads You will pay for it somewhere. Prometheus? Thanos? ClickHouse? The cost just moves around. Prometheus pays in memory. Thanos pays in object storage and index size. ClickHouse pays in shard pressure and query planning. Cardinality doesn’t disappear. It just changes shape. --- ## The Operational Culture Angle There’s another layer here people don’t talk about enough. Thanos feels like “Prometheus, but bigger.” ClickHouse feels like “we’re running a distributed analytical database.” Those are culturally different moves. Thanos scaling looks like: - Add store nodes. - Add query nodes. - Let object storage handle blocks. ClickHouse scaling looks like: - Plan shard counts. - Balance partitions. - Manage replication. - Think about disk locality. - Tune distributed SQL. One commenter even said changing shard count in ClickHouse is painful . That’s not a trivial complaint. So Tesla’s move likely wasn’t about ease. It was about capability. --- ## Metrics vs Multi-Signal Data Warehouse Another possibility? They wanted one system for more than just metrics. ClickHouse is frequently used for: - Logs - Traces - Events - Business analytics If you centralize telemetry into one analytical engine, you reduce system sprawl. You trade: Operational simplicity (Prometheus + Thanos) For: Analytical flexibility (ClickHouse + SQL + custom tooling) That’s not a “better” decision. It’s a different optimization target. --- ## The Real Question: What Was Their Workload? One commenter asked bluntly: What is high to you? That’s the right question. If Tesla was operating at: - Hundreds of millions of active time series - Massive long-term retention requirements - Cross-domain analytics workloads - Out-of-order ingestion windows - Complex query workloads beyond PromQL’s strengths Then a columnar analytics database starts to make sense. But that doesn’t invalidate Thanos. It just suggests their use case might have extended beyond “metrics monitoring.” --- ## The Weird Choice Comment One of the most candid replies called the move “a very weird choice indeed” . That reaction says something. Because for many Prometheus operators, Thanos is the natural scaling story. Object storage. Stateless query layer. Easy component scaling. Well-understood PromQL. Moving to ClickHouse looks like abandoning that ecosystem. But if you zoom out, it might not have been abandonment. It might have been convergence. Metrics becoming part of a broader analytics fabric. --- ## What This Actually Teaches This debate isn’t about Tesla. It’s about architectural philosophy. If you believe metrics are: - Operational signals - Time-series-first - Alert-driven - PromQL-native Then Thanos (or Cortex/Mimir) feels like the cleanest scaling story. If you believe metrics are: - Just another data source - To be queried alongside logs and events - Part of a unified analytical lake - Best explored via SQL and distributed compute Then ClickHouse makes sense. But you pay for that flexibility in complexity. --- ## Prometheus Didn’t “Lose” The most interesting detail remains the PromQL transpiler . Even after moving storage, they preserved PromQL semantics. That tells you Prometheus didn’t fail conceptually. Its data model and query language were still valuable enough to emulate. Storage moved. Semantics stayed. That’s not defeat. That’s evolution. --- ## The Real Takeaway “Why not just scale Thanos?” is the right question . But the better question is: What problem were they actually trying to solve? Scaling metrics horizontally is one problem. Unifying massive telemetry datasets under a powerful analytical engine is another. Prometheus plus Thanos can handle astonishing scale — billions of series in some setups . ClickHouse can handle enormous analytical workloads — but demands a different operational mindset. The choice isn’t about capability alone. It’s about what kind of system you want to run. And once you understand that, the move doesn’t look weird. It looks intentional.

Why Did Tesla Move to ClickHouse Instead of Scaling Thanos - And What That Actually Says About Prometheus at Scale

Keep Exploring