Mr.PlanB

There’s a particular kind of frustration that only shows up after a migration. You switch from something like Datadog to Prometheus. Dashboards are rebuilt. Alerts are rewritten. Everything feels leaner, more “open,” more under your control. And then the counters start lying to you. Or at least it feels that way. Low-frequency counters miss increments. Short-lived series disappear from totals. Success rates don’t quite add up over 30 days. Alerts don’t fire when they should. Or worse — they fire late. One team described it bluntly: counters have been the biggest pain point since the switch. Things that “just worked” before now require careful thinking. And even when you think you’ve done it right, you’re still uneasy. Is Prometheus unreliable? Or are we asking it to behave like something it was never meant to be? Let’s dig in. --- ## The Core Complaint: Slow Counters + Dynamic Labels = Anxiety The real pain shows up in a few predictable places : - Alerting on a counter increase when the counter doesn’t start at zero. - Calculating total increments over a time range, especially when short-lived series exist. - Viewing frequency of increments as a time series without weird artifacts. - Computing long-term success rates using `sum(rate(success_total[30d])) / sum(rate(overall_total[30d]))` and realizing short-lived series skew results. There’s also a meta frustration: the raw data is there. If you manually eyeball the graph, you can often see what “should” be counted. But `rate()` or `increase()` seems to understate things. And that’s not a bug. It’s intentional. Prometheus prefers undercounting over overcounting when it detects edge cases like resets or missed scrapes. For SREs, that safety bias can feel backwards. A false negative alert is worse than a false positive in many real-world setups. So when your monitoring system chooses safety over sensitivity, it can feel like it’s betraying you. --- ## The First Hard Truth: Very Slow Counters Are Awkward One experienced voice cut straight to it: very slow-moving counters are a difficult issue with Prometheus . If you scrape every 30 seconds and something increments once every 20 minutes, you’re working with sparse data. Now add: - Dynamic labels - Short-lived pods - Restarts from deployments - Autoscaling events Suddenly, you’re not measuring a clean monotonic series anymore. You’re measuring fragments of counters across instances. Prometheus handles counter resets. It does not magically reconstruct missing history from pods that disappeared. That’s not incompetence. That’s physics. --- ## The Cardinality Trap One of the most consistent recommendations: reduce cardinality for important SLO metrics . Too many teams add debugging-level labels directly to error counters: - user_id - request_id - feature_flag - shard - deployment hash It feels powerful. It’s also a recipe for sparsity. Sparse series make `rate()` less reliable because the time window might include series that only existed for a small portion of that window. Your 30-day success rate query now includes dozens of micro-series that blinked in and out of existence. Metrics are supposed to answer: “Is there a problem at X time?” They are not meant to replace your logs. When you overload counters with forensic-level labeling, you’re stretching Prometheus beyond its design center. --- ## Short-Lived Workers: The Silent Saboteur Another pointed comment: short-lived metrics are an anti-pattern . If you’re using queue-dispatched ephemeral workers or FaaS-style compute, your counters may: 1. Start. 2. Increment once. 3. Vanish. Now your `increase()` calculation has to reconcile series that only existed for a few scrapes. That’s where weird undercounting shows up. One solution mentioned: accumulator exporters. Instead of each short-lived worker exposing its own counter, push increments to a central accumulator (StatsD-style), and let Prometheus scrape that stable source . In modern stacks, you might use OpenTelemetry cumulative deltas feeding into a single aggregation collector. The theme is consistent: stabilize the counter before Prometheus sees it. Prometheus likes long-lived time series. It does not love flickering ones. --- ## “Created Timestamp Injection” Isn’t Magic For counters that don’t start at zero, Prometheus now supports created timestamp zero injection via OpenMetrics . That helps with some startup ambiguity. But it doesn’t eliminate all edge cases. If a pod restarts and your scrape interval misses part of the lifecycle, you can still see partial data. Prometheus tries to handle resets intelligently, but it will err on the side of underestimating. That’s by design. If you were expecting “perfect delta reconstruction,” you’re expecting something Prometheus doesn’t promise. --- ## Recording Rules: Boring, Powerful, Underrated The Grafana SLO feature approach — layered recording rules like: ``` Codesum(sum_over_time((grafana_slo_success_rate_5m{})[28d:5m])) / sum(sum_over_time((grafana_slo_total_rate_5m{})[28d:5m])) ``` Feels complicated. But it’s not random ceremony. Pre-aggregating into stable 5-minute deltas via recording rules makes long-range SLO math far more reliable . One key suggestion: test your recording rules. Use `promtool test rules`. Add alerts if recording rules stop evaluating. If you don’t test them, you’re trusting invisible plumbing. If you do test them, they’re surprisingly solid. The irony is that many teams trust raw PromQL more than recording rules — even though recording rules, when tested, are often safer. --- ## The Float Problem (Yes, It’s Real) Prometheus uses floats. That means counters lose +1 precision at very high magnitudes (around 2^53) . For most web workloads, you’ll never hit that. For high-speed interfaces or massive aggregations, you might. One clever workaround mentioned: wrapping uint64 counters modulo 2^53 before export . It’s niche. But it’s a reminder that Prometheus made trade-offs early on — and those decisions still ripple outward. --- ## Metrics vs. Logs: The Confidence Crisis There’s a deeper issue underneath the technical complaints. When teams start saying: “Maybe we should just use logs for this.” That’s not about math. That’s about trust. Metrics feel less credible when they miss edge cases. When a 30-day success rate query behaves differently depending on window size, confidence erodes. But here’s the uncomfortable truth: Metrics were never meant to be perfect forensic accounting systems. They are coarse, aggregated signals. Logs are precise. Metrics are scalable. If you demand log-level exactness from counters, you’ll always feel let down. --- ## The Most Interesting Idea: “Materialized Metrics” One long-term proposal floated in the discussion: a new pipeline inside Prometheus that converts scrapes back into deltas, then re-materializes them into projected counters after label reduction . Drop `instance`. Drop ephemeral dimensions. Aggregate deltas first. Rebuild stable counters. It’s ambitious. It acknowledges a central pain: many users think in deltas, not cumulative counters. They want event-like semantics with metric-scale performance. If something like that ships, it could reshape how people think about counters entirely. --- ## So Are Counters “Very Unreliable”? No. But they are unforgiving. Prometheus counters are extremely reliable when: - Series are long-lived. - Cardinality is controlled. - Scrape intervals match event frequency. - Recording rules are used for long-range math. - You’re measuring system health, not auditing transactions. They become uncomfortable when: - You mix debugging labels into SLO metrics. - You rely on short-lived workers. - You expect perfect reconstruction across restarts. - You stretch `rate()` across 30 days of fragmented series. The title calling them “very unreliable” was admitted exaggeration . But the frustration is real. What’s happening isn’t that Prometheus is broken. It’s that Datadog-style mental models don’t transfer cleanly. Prometheus forces you to think about: - Scrape intervals. - Series lifespan. - Label economics. - Aggregation strategy. That cognitive load feels like regression at first. Until you realize it’s just a different contract. And like most contracts in distributed systems, the fine print matters.

Prometheus Counters Are Not Broken - But They Are Breaking Teams Who Treat Them Like Datadog

Keep Exploring