You cannot go wrong with the most popular choice: Prometheus/Grafana stack. That includes node_exporter for anything host related, and optionally Loki (and one of its agents) for logs. All this can run anywhere, not just on k8s.
Related, have there been any 'truly open-source' forks of Grafana since their license change? Or does anyone know of good Grafana alternatives from FOSS devs in general? My default right now is to just use Prometheus itself, but I miss some of the dashboard functionality etc. from Grafana.
Grafana's license change to AGPLv3 (I suspect to drive their enterprise sales), combined with an experience I had reporting security vulnerabilities, combined with seeing changes like this[1] not get integrated has left a bad taste in my mouth.
AGPLv3 is a completely valid choice for an open source license, and (not that it was necessarily questioned, but since critique of pushing enterprise sales comes up,) having a split open source/enterprise license structure is not particularly egregious and definitely not new. Some people definitely don't like it, but even Richard Stallman is generally approving of this model[1]. It's hard to find someone more ideologically-oriented towards the success and proliferation of free and open source software, though that obviously doesn't mean everyone agrees.
I'm not saying, FWIW, that I think AGPL is "good", but it is at least a perfectly valid open source license. I'm well aware of the criticisms of it in general. But if you're going to relicense an open source project to "defend" it against abuse, AGPL is probably the most difficult to find any objection to. It literally exists for that reason.
I don't necessarily think that Grafana is the greatest company ever or anything, but I think these gripes are relatively minor in the grand scheme of things. (Well, the security issue might be a bit more serious, but without context I can't judge that one.)
To be fair, AGPLv3 is a very valid open source licence.
Now, poor and bad behaviour from the prom maintainers is a very fertile subject. If you want to see some real spicy threads check out the one where people raised that Prom’s calculation of rate is incorrect, or the thread where people asked for prom to interpolate secrets into its config from env cars - like every other bit of common cloud-adjacent software.
Both times prom devs behaved pretty poorly and left really bad taste in my mouth. Victoria Metrics seems like a much better replacement.
AGPL prevents from wide product adoption, since corporate lawyers caution against relying on AGPL products because it is easy to violate the license terms and being sued after that.
It's not possible to sell non-FOSS modifications to AGPL-licensed software. I think that's intended. It's not antithetical to Open Source, quite the opposite in fact.
Yeah, but lawyers (and companies where these lawyers work) are afraid of licenses with unclear or vague terms such as GPL, LGPL, AGPL, BSL, etc. They prefer to deal with software licensed under clear and concise open-source licenses such as Apache2, MIT and BSD.
Companies care about open source if it helps them increasing their revenue:
- If they use open source code in their commercial products, then they care about the ability to freely use the code without legal consequences.
- If they develop open source product, then they care about increasing the adoption rate of the product.
In both cases truly open source licences such as Apache2, BSD and MIT, work the best. Copyleft licences with some arbitrary restrictions on code use prevent from wide adoption of the licensed project.
There is only a single well-known exception - Linux kernel with GPL v2 license. Commercial companies have to figure out how to use Linux kernel in their commercial products because there is no good alternative.
Maybe I should start insisting on the term "FOSS".
Pushover licenses ("truly open source") enable the exploitation of FOSS developers in the name of easy profit for the people building proprietary software around it, while Copyleft licenses ensure that this does not happen, granting each user the essential freedoms. The restrictions are not arbitrary, they exist precisely to ensure that these freedoms cannot be taken away from anyone. If this hinders widespread adoption by companies, it just means that those companies didn't plan on respecting the essential freedoms.
Freedom is the ability to use the open source code without any restrictions. Copyleft licences restrict the freedom. These licences sound good in theory (let's prevent from unpaid use of the code in proprietary products!), but they work not so good in practice (why bothering with legal headache related to copyleft-licensed code if it is easier to use BSD-licensed code?). This prevents from wide adoption of copyleft-licensed products.
You're misinterpreting it. Integrating FOSS code into a proprietary product is what restricts the user's freedom. Copyleft licenses prevent this restriction. And yes, indeed, why bother working for freedom if it's easier to not have freedom?
> Integrating FOSS code into a proprietary product is what restricts the user's freedom. Copyleft licenses prevent this restriction.
This is like saying "black is white".
Users are free to use any products - open source and proprietary. They don't care about licenses most of the time - they prefer the product with better usability. Copyleft licenses prevent from creating proprietary product with better usability on top of open-source product with mediocre usability. E.g. copyleft licenses restrict users' freedom to use the best product - they force users dealing with the mediocre product.
Take a simpler example. If you have the freedom to imprison me for no reason, you can take away my (literal) freedom. Now you are free, but I am not. Because of this imbalance, the freedom to arbitrarily imprison people is an unreasonable one. Everyone should have as much freedom as possible, and everyone should have the same "amount" of freedom, if you will, so restricting others is out of scope. It's not just about your own freedom, don't be selfish! And besides, what ethically good person would want to lock people up for no reason?
When you are creating proprietary software, you are asking your users to let them be oppressed by you. You are asking them to give in to potential surveillance, planned obsolescence, manipulation, extortion and a variety of other injustices. And when you charge a price for your proprietary application, you are asking your users to pay for this mistreatment. What value does "the best product" really have, when you pay for it with your wallet and your freedom?
Your antique monetization scheme does not align with the values of Free Software. Should you restrict your users' freedom, or fix your monetization scheme?
> combined with them not being a good steward for changes like this[1] left a bad taste in my mouth.
What they did wrong with this PR? It seems eventually they realized the scope was much bigger, requiring changes on both the frontend and backend, and asked potential contributors to reach out if they're interested in contributing that particular feature (saying between the lines that they themselves don't have a use, but they won't reject a PR).
Seems like they didn't need it themselves, and asked the community to contribute it if someone really wanted it, but no one has stepped up since then.
I'm using VictoriaMetrics instead of Prometheus, am doing something wrong?
I have zabbix as well as node_exporter and Percona PMM for mysql servers because sometimes it is hard to configure prometheus stack for snmp when zabbix cover this case out of the box.
Prometheus itself is pretty simple, fairly robust, but doesn’t necessarily scale for long-term storage as well. Things like VictoriaMetrics, Mimir, and Thanos tend to be a bit more scalable for longer term storage of metrics.
For a few hundred gigs of metrics, I’ve been fine with Prometheus and some ZFS-send backups.
Just to expand upon some experiences with some of the listed software.
The architecture is quite different between Thanos and the others you've listed as unlike the others, Thanos queries fan out to remote Prometheus instances for hot data and then ship out data (typically older than 2 hours) via a sidecar to s3 storage.
As the routing of the query depends on setting Prometheus external labels, our developer queries would often fan out unnecessarily to multiple prometheus instances. This is because our developers often search for metrics via a service name or some service related label rather than use an external label which describes the location of the workload which is used by Thanos.
Upon identifying this, I migrated to Mimir and we saw immediate drops in query response times for developer queries which now don't have to wait for the slowest promethues instances before displaying the data.
We've also since adopted OpenTelemetry in our workloads and directly ingest otlp in to Mimir (Which VictoriaMetrics also support).
I wrote an extensive reply to this but unfortunately the HN servers restarted and lost it.
The TL;DR was that from where I stand, you’re doing nothing wrong.
In a previous client we ran Prometheus for months, then Thanos, and eventually we implemented Victoria Metrics and everyone was happy. It became an order of magnitude cheaper due to using spinning rust for storage and still getting better performance. It was infinitely and very easily scalable, mostly automatically.
The “non-compliant” bits of the query language turned out to be fixes to the UX and other issues. Lots of new functions and features.
Support was always excellent.
I’m not affiliated with them in any way. Was always just a very happy freeloading user.
I have deployed lots of metrics systems, starting with cacti and moving through graphite, kairosdb (which used Cassandra under the hood), Prometheus, Thanos and now Mimir.
What I've realised is that they're all painful to scale 'really big'. One Prometheus server is easy. And you can scale vertically and go pretty big. But you need to think about redundancy, and you want to avoid ending up accidentally running 50 Prometheus instances, because that becomes a pain for the Grafana people. Unless you use an aggregating proxy like Promxy. But even then you have issues running aggregating functions across all of the instances. You need to think about expiring old data and possibly aggregating it down into into a smaller set so you can still look at certain charts over long periods. What's the Prometheus solution here? MOAR INSTANCES. And reads need to be performant or you'll have very angry engineers during the next SEV1, because their dashboards aren't loading. So you throw in an additional caching solution like Trickster (which rocks!) between Grafana and the metrics. Back in the Kairosdb days you had to know a fair bit about running Cassandra clusters, but these days it's all baked into Mimir.
I'm lucky enough to be working for a smaller company right now, so I don't have to spend a lot of time tending to the monitoring systems. I love that Mimir is basically Prometheus backed by S3, with all of the scalability and redundancy features built in (though you still have to configure them in large deployments). As long as you're small enough to run their monolithic mode you don't have to worry about individually scaling half a dozen separate components. The actual challenge is getting the $CLOUD side of it deployed, and then passing roles and buckets to the nasty helm charts while still making it easy to configure the ~10 things that you actually care about. Oh and the helm charts and underlying configs are still not rock solid yet, so upgrades can be hairy.
Ditto all of that for logging via Loki.
It's very possible that Mimir is no better than Victoria Metrics, but unless it burns me really badly I think I'll stick with it for now.
Well, they claim superior performance (which might be true), but the costs are high and include a small community, low quality APIs, best effort correctness/PromQL compatibility, and FUD marketing, so I decided to go with the de-facto standard without all of the issues above.
No costs if you're hosting everything. It does scale better and has better performance. Used it and have nothing bad to say about it. For the most part a drop-in replacement that just performs better. Didn't run into PromQL compatibility issues with off-the-shelf Grafana dashboards.
I am on mobile, so cannot really link GitHub for examples, but I'd recommend anyone considering using VM over Prometheus to take a cursory look into how similar things are implemented in both projects, and what shortcuts were made in the name of getting "better performance".
Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
Regarding FUD marketing: All Prometheus community channels (mailing lists, StackOverflow, Reddit, GitHub, etc.) are full of VM devs pushing VM, bashing everything from the ecosystem without mentioning any of the tradeoffs. I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?) which is a very similar to Microsoft's embrace, extend, and extinguish strategy.
As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531, https://news.ycombinator.com/item?id=39391208, but it's really hard to avoid all the rest in the places mentioned above.
> Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
prometheus-benchmark ( https://github.com/VictoriaMetrics/prometheus-benchmark ) tests CPU usage, RAM usage and disk usage for typical alerting queries. It doesn't test the performance of queries used for building graphs in Grafana because the typical rate of alerting queries is multiple orders of magnitude bigger than the typical rate of queries for building graphs, e.g. alerting queries generate the most of load on CPU, RAM and disk IO in typical production workload.
This submission posts a link to the real-world experience of long-term user of Grafana Loki. This user points to various issues in applications he uses. For example:
As you can see, this user shares his extensive experience with Grafana Loki, and continues using it despite the fact that there is much better solution exists, which is free from all the Loki issues - VictoriaLogs. This user isn't affiliated with VictoriaMetrics in any way.
Yeah, Ive been working on deploying such with added txtai indexing so I can just ask my stack questions - setup txtai workflows and be able to slice questions across what youre monitoring.