Might be a little off-topic, but I don't understand why monitoring solutions are...

jotato · on Dec 28, 2020

> Might be a little off-topic, but I don't understand why monitoring solutions are both so expensive and simultaneously tend to offer such a low resolution. HTTPS-calls aren't particularly computationally expensive

This is 100% the driving force behind pricing at my saas (lean20.com). Requests are cheap, but from our analysis people want 1/min intervals. What is your use-case of it needing to be more than that?

superice · on Dec 28, 2020

Internally it's about accuracy of reporting. If you do high uptimes (say >99.95%) rounding starts to matter.

If you want the less bureaucratic version: downtime for mission critical apps means customers call. Immediately. If your SaaS to manage a container terminal is down, that means the container terminal is down. That turns very expensive very fast. Knowing the system is down before you receive the first call is vital. Monitoring once a minute and taking 2 failing calls before reporting means you have a 2 minute delay. That is the difference between telling the customer 'yes, we've noticed and we're working on it' versus 'what downtime?' on the phone

EDIT: I looked into your startup, and definitely love that pricing model. That seems like the right stategy, and beats other monitoring solutions easily at scale (>10 hosts seems to be the cross-over point roughly). If you can report downtime within say 20s of it actually occurring via a webhook or Slack integration or so, I'd love to have an invite. E-mail is in my profile.

messutied · on Dec 28, 2020

I think most monitoring services offer at most 1 HTTP request / minute checks, is there really need for lower intervals?

lma21 · on Dec 28, 2020

Unless the alert is managed by a downstream automated system, I don't see the point of having an internal that is smaller than 1 minute -- 1 second or 1 minute won't differ much for a human interaction, right? Am I missing something?

superice · on Dec 28, 2020

Yes it does. It takes about 20 seconds to place a call. If a mission critical system goes down, customers will refresh a grand total of 3 times, then call. As I explained in the sibling comment, reporting downtime within a minute is the difference between 'what do you mean the system is down?' and 'yeah, we noticed and are working on it'.

The idea that polling should be done every minute or so is madness. Here are a few thoughts:

- If your app is meaningfully impacted by even 1 requests-per-second extra you have more serious problems to fix.

- Rounding matters if you do high uptime. If 4 minutes downtime per month is high (e.g. ~99.99% uptime), rounding to the nearest minute could make a 50% difference in your downtime reported.

- You can speed up your resolving of any downtime by a almost a minute without any extra effort in terms of teaching people or having better processes in place. Even if the gains are small, this is the logical place to spend a tiny bit more to gain a minute on every downtime.

- For most things, standards don't need to be that high. But it doesn't hurt to have quicker feedback, and the cost is neglegible.

- Some systems are quick enough in recovery that you don't notice the downtime, like when switching over to a hot replica. Your customers will notice any short downtime that occurs. Why? Scale. Your polling client is only one, but you might have thousands or millions of customers. Chances are at least one of them notices a hiccup. You should view it as your responsibility to ALSO be aware of this, even if you choose not to actively improve on this.

- To phrase it bluntly: standards are too damn low in DevOps.