On a related note, Cloudflare just introduced "Super Bot Fight Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which is basically a whitelisting approach that will block any automated website crawling that doesn't originate from "good bots" (they cite Google & Paypal as examples of such bots). So basically everyone else is out of luck and will be tarpitted (i.e. connections will get slower and slower until pages won't load at all), presented with CAPTCHAs or outright blocked. In my opinion this will turn the part of the web that Cloudflare controls into a walled garden not unlike Twitter or Facebook: In theory the content is "public", but if you want to interact with it you have to do it on Cloudflare's terms. Quite sad really to see this happen to the web.
On the other hand, I do not want my site to go down thanks to a few bad 'crawlers' that fork() a thousand http requests every second and take down my site, forcing me to do manual blocking or pay for a bigger server/scale-out my infrastructure. Why should I have to serve them?
Right, then they shouldn't be effected by the rate-limiting, as long as its reasonable. If it was applied evenly to all clients/crawlers, it'd at least allow the possibility for a respectful, well designed crawler to compete.
The problem is, if you own a website, it takes the same amount of resources to handle the crawl from Google and FooCrawler even if both are behaving, but I'm going to get a lot more ROI out of letting Google crawl, so I'm incentivized to block FooCrawler but not Google. In fact, the ROI from Google is so high I'm incentivized to devote extra resources just for them to crawl faster.
Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.
In the early 90s there were various nascent systems for essentially public database interfaces for searching
The idea was that instead of a centralized search, people could have fat clients that individually query these apis and then aggregate the results on the client machine.
Essentially every query would be a what/where or what/who pair. This would focus the results
I really think we need to reboot those core ideas.
We have a manual version today. There's quite a few large databases that the crawlers don't get.
The one place for everything approach has the same fundamental problems that were pointed out 30 years ago, they've just become obvious to everybody now.
I wonder what happens to RSS feeds in this situation. Programs I run that process RSS feeds will just fetch them over HTTP completely headlessly, so if there are any CAPTCHAs, I'm not going to see them.
I've found that Cloudflare isn't great at this. I even found cases where my site was failing to load to googlebot (a "good" bot that they probably have the IPs for) because they were serving a captcha instead of my CSS.
So your best bet is setting a page rule to allow those URLs.
In my experience, those either get detected(?) and let through (rss can be agressively cached after all) or you're out of luck and the website owner set up e.g. wordpress (which automatically included rss URLs) but did not configure cloudflare to let rss through.
That will be interesting to see with regards to legal implications. If they (in the website operator's name) block access to e.g. privacy info pages to a normal user "by accident", that could be a compliance issue.
I don't think it's mass blocking is the right approach in general. IPs, even residential, are relatively easy and relatively cheap. At some point you're blocking too many normal users. Captchas are a strong weapon, but they too have a significant cost by annoying the users. Cloudflare could theoretically do invisible-invisible captchas by never even running any code on the client, but that would be wholesale tracking and would probably not fly in the EU.
It's not Cloudflare who is deciding it. It's the website owners who request things like "Super Bot Fight Mode". I never enable such things on my CF properties. Mostly it's people who manage websites with "valuable" content, e.g. shops with prices who desperately want to stop scraping by competitors.
I can say this will give a lot of businesses false sense of security. It is already bypassable.
the Web scraping technology that I am aware of has reached end game already: Unless you are prepared to authenticate every user/visitor to your website with a dollar sign, lobby congress to pass a bill to outlaw web scraping, you will not be able to stop web scraping in 2021 and beyond.
Due to aggressive no-script and uBlock use I, browsing the website as a human, keep getting hit by captchas and my success rate is falling to a coinflip. If there's a script to automate that I'm all ears.
100% doable. Like I said these type of blanket throttling seems to be the new trend but it's already defeated.
I just no longer see it possible to 1) put information on the web (private or public) 2) give access outside your organization (customers or visitors) 3) expect your website will not be scraped.
Or maybe don’t “hate” folks who are just trying to put some content online and don’t want to deal with botnets taking down their work? You know, like what the internet was intended for.
> don’t want to deal with botnets taking down their work
Botnets and automated crawling are completely different things. This isn't about preventing service degradation (even if it gets presented that way). It's an attempt by content publishers to control who accesses their content and how.
Cloudflare is actively assisting their customers to do things I view as unethical. Worse, only Cloudflare (or someone in a similarly central position) is capable of doing those things in the first place.
Internet was certainly not intended for centralization. I hit Cloudflare captchas and error pages so often it's almost sickening. So many things are behind Cloudflare, things you least expect to be behind Cloudflare.
It's easy enough to bypass most Cloudflare “anti-bot” with an unusual refresh pattern or messing with a cookie. (It's easier to script this than solve the CAPTCHAs.)
Anyone malicious who is determined enough will just pay their way through any captcha. Yet for me, as a legitimate user, these "one more step" pages feel downright humiliating. At this point, if I see one, I either just nope out of it, or look for a saved copy on archive.org.