Back to Blog
Prometheus
Thanos
ClickHouse
Observability
Why Did Tesla Move to ClickHouse Instead of Scaling Thanos - And What That Actually Says About Prometheus at Scale
January 7, 2026
9 min read
Every time a big company moves off a well-known stack, the internet reacts the same way.
“Wait… why didn’t they just scale it?”
That’s exactly what happened when people saw Tesla talk about moving metrics workloads to ClickHouse. The claim floating around was that “Prometheus doesn’t scale horizontally,” and the obvious counter-question showed up immediately:
Why not just use Thanos or Cortex?
It’s a fair question. And the answers are more interesting than the headline.
---
## First: Prometheus *Does* Scale — But With Design Choices
Let’s kill one myth right away.
Prometheus absolutely scales. Just not in the same way a distributed SQL database does.
Out of the box, a single Prometheus server is vertically scalable. Add RAM. Add CPU. Tune retention. Control cardinality.
At serious scale, you introduce horizontal patterns:
- Sharding by functional domain
- Federation
- Thanos
- Cortex
- Mimir
As one experienced operator bluntly put it:
> Prometheus can handle 100 million cardinality. Thanos can handle billions.
That’s not hobby scale.
That’s real infrastructure.
So the idea that “Prometheus doesn’t scale horizontally” isn’t quite accurate. It just doesn’t scale the same way a distributed OLAP database does.
---
## The Thanos Argument: Elastic and Object-Storage Native
One of the strongest replies in the discussion came from someone running both Thanos and ClickHouse — and choosing each for different signals .
Their setup:
- Thanos for metrics
- ClickHouse for logs, traces, errors
That split is telling.
Metrics workloads and log workloads are fundamentally different beasts.
And they point out something important:
ClickHouse scaling horizontally is arguably harder in some ways because each shard has local persistent disk that needs care. Changing shard count is painful .
That’s a very real operational burden.
With Thanos?
- Query components can scale horizontally.
- Store nodes scale with object storage.
- Deployment and StatefulSets make scaling straightforward.
- Data is sharded naturally via S3 object storage.
In other words: scaling Thanos is mostly orchestration and object storage.
Scaling ClickHouse is shard math and disk planning.
Different trade-offs.
---
## So Why Would Tesla Choose ClickHouse?
Now we get into the interesting part.
Because it’s probably not about “Prometheus can’t scale.”
It’s about what kind of scale they needed.
ClickHouse is an analytical database.
It’s optimized for:
- Massive columnar datasets
- Complex aggregations
- Ad-hoc queries
- Long retention
- Multi-signal analysis
Prometheus (even with Thanos) is optimized for:
- Time-series metrics
- Operational queries
- Real-time alerting
- High-ingest telemetry
- PromQL semantics
Those goals overlap — but they’re not identical.
If Tesla wanted:
- Cross-analysis of metrics with other telemetry
- Deep historical analytics
- Arbitrary SQL-style aggregations
- Very long retention at massive scale
ClickHouse starts to look attractive.
---
## But Here’s the Catch: They Had to Rebuild PromQL
One of the most interesting details in the discussion:
Tesla reportedly introduced their own transpiler (Comet) from PromQL to SQL — in cooperation with the ClickHouse team .
That detail changes the tone.
If you move away from Prometheus storage but still want PromQL semantics, you have two options:
1. Abandon PromQL.
2. Recreate it.
They chose to recreate it.
That suggests something powerful:
PromQL is hard to replace.
Even if you change storage engines, the query model still matters.
If ClickHouse were a drop-in Prometheus replacement, there’d be no need for a transpiler layer.
The fact that one was built tells you this wasn’t trivial.
---
## High Cardinality Isn’t “Solved” Anywhere
Another thread in the discussion hit a core issue:
What about high cardinality? Isn’t that still a problem in Thanos or Cortex?
Correct.
High cardinality is not magically solved by switching backends.
It’s a data modeling problem.
If your metric design includes:
- Unbounded labels
- Per-user dimensions
- Dynamic identifiers
- Ephemeral workloads
You will pay for it somewhere.
Prometheus?
Thanos?
ClickHouse?
The cost just moves around.
Prometheus pays in memory.
Thanos pays in object storage and index size.
ClickHouse pays in shard pressure and query planning.
Cardinality doesn’t disappear. It just changes shape.
---
## The Operational Culture Angle
There’s another layer here people don’t talk about enough.
Thanos feels like “Prometheus, but bigger.”
ClickHouse feels like “we’re running a distributed analytical database.”
Those are culturally different moves.
Thanos scaling looks like:
- Add store nodes.
- Add query nodes.
- Let object storage handle blocks.
ClickHouse scaling looks like:
- Plan shard counts.
- Balance partitions.
- Manage replication.
- Think about disk locality.
- Tune distributed SQL.
One commenter even said changing shard count in ClickHouse is painful .
That’s not a trivial complaint.
So Tesla’s move likely wasn’t about ease.
It was about capability.
---
## Metrics vs Multi-Signal Data Warehouse
Another possibility?
They wanted one system for more than just metrics.
ClickHouse is frequently used for:
- Logs
- Traces
- Events
- Business analytics
If you centralize telemetry into one analytical engine, you reduce system sprawl.
You trade:
Operational simplicity (Prometheus + Thanos)
For:
Analytical flexibility (ClickHouse + SQL + custom tooling)
That’s not a “better” decision.
It’s a different optimization target.
---
## The Real Question: What Was Their Workload?
One commenter asked bluntly:
What is high to you?
That’s the right question.
If Tesla was operating at:
- Hundreds of millions of active time series
- Massive long-term retention requirements
- Cross-domain analytics workloads
- Out-of-order ingestion windows
- Complex query workloads beyond PromQL’s strengths
Then a columnar analytics database starts to make sense.
But that doesn’t invalidate Thanos.
It just suggests their use case might have extended beyond “metrics monitoring.”
---
## The Weird Choice Comment
One of the most candid replies called the move “a very weird choice indeed” .
That reaction says something.
Because for many Prometheus operators, Thanos is the natural scaling story.
Object storage.
Stateless query layer.
Easy component scaling.
Well-understood PromQL.
Moving to ClickHouse looks like abandoning that ecosystem.
But if you zoom out, it might not have been abandonment.
It might have been convergence.
Metrics becoming part of a broader analytics fabric.
---
## What This Actually Teaches
This debate isn’t about Tesla.
It’s about architectural philosophy.
If you believe metrics are:
- Operational signals
- Time-series-first
- Alert-driven
- PromQL-native
Then Thanos (or Cortex/Mimir) feels like the cleanest scaling story.
If you believe metrics are:
- Just another data source
- To be queried alongside logs and events
- Part of a unified analytical lake
- Best explored via SQL and distributed compute
Then ClickHouse makes sense.
But you pay for that flexibility in complexity.
---
## Prometheus Didn’t “Lose”
The most interesting detail remains the PromQL transpiler .
Even after moving storage, they preserved PromQL semantics.
That tells you Prometheus didn’t fail conceptually.
Its data model and query language were still valuable enough to emulate.
Storage moved.
Semantics stayed.
That’s not defeat.
That’s evolution.
---
## The Real Takeaway
“Why not just scale Thanos?” is the right question .
But the better question is:
What problem were they actually trying to solve?
Scaling metrics horizontally is one problem.
Unifying massive telemetry datasets under a powerful analytical engine is another.
Prometheus plus Thanos can handle astonishing scale — billions of series in some setups .
ClickHouse can handle enormous analytical workloads — but demands a different operational mindset.
The choice isn’t about capability alone.
It’s about what kind of system you want to run.
And once you understand that, the move doesn’t look weird.
It looks intentional.
Keep Exploring
Prometheus Counters Are Not Broken - But They Are Breaking Teams Who Treat Them Like Datadog
Why counter semantics confuse teams during Datadog to Prometheus migrations, and the query patterns that avoid silent misreads.
How One Team Slashed Prometheus Memory From 60GB to 20GB - And Exposed the Silent Cardinality Crisis
A real case study on cutting Prometheus memory usage from 60GB to 20GB by identifying toxic labels and reclaiming scrape reliability.
Prometheus: How We Slashed Memory Usage - And Discovered Our Dashboards Were the Real Problem
A step-by-step breakdown of reducing Prometheus memory pressure by auditing high-cardinality metrics and fixing expensive dashboard query habits.
Because Real Heroes Build Their Own Exporters - And Sometimes That Is Exactly the Right Move
Building a custom exporter can be the fastest path to useful observability when critical systems lack stable community integrations.