I wrote an extensive reply to this but unfortunately the HN servers restarted an...

raffraffraff · on Aug 18, 2024

I have deployed lots of metrics systems, starting with cacti and moving through graphite, kairosdb (which used Cassandra under the hood), Prometheus, Thanos and now Mimir.

What I've realised is that they're all painful to scale 'really big'. One Prometheus server is easy. And you can scale vertically and go pretty big. But you need to think about redundancy, and you want to avoid ending up accidentally running 50 Prometheus instances, because that becomes a pain for the Grafana people. Unless you use an aggregating proxy like Promxy. But even then you have issues running aggregating functions across all of the instances. You need to think about expiring old data and possibly aggregating it down into into a smaller set so you can still look at certain charts over long periods. What's the Prometheus solution here? MOAR INSTANCES. And reads need to be performant or you'll have very angry engineers during the next SEV1, because their dashboards aren't loading. So you throw in an additional caching solution like Trickster (which rocks!) between Grafana and the metrics. Back in the Kairosdb days you had to know a fair bit about running Cassandra clusters, but these days it's all baked into Mimir.

I'm lucky enough to be working for a smaller company right now, so I don't have to spend a lot of time tending to the monitoring systems. I love that Mimir is basically Prometheus backed by S3, with all of the scalability and redundancy features built in (though you still have to configure them in large deployments). As long as you're small enough to run their monolithic mode you don't have to worry about individually scaling half a dozen separate components. The actual challenge is getting the $CLOUD side of it deployed, and then passing roles and buckets to the nasty helm charts while still making it easy to configure the ~10 things that you actually care about. Oh and the helm charts and underlying configs are still not rock solid yet, so upgrades can be hairy.

Ditto all of that for logging via Loki.

It's very possible that Mimir is no better than Victoria Metrics, but unless it burns me really badly I think I'll stick with it for now.