On a related note, Cloudflare just introduced "Super Bot Fight Mode" (https://bl...

judge2020 · on March 26, 2021

On the other hand, I do not want my site to go down thanks to a few bad 'crawlers' that fork() a thousand http requests every second and take down my site, forcing me to do manual blocking or pay for a bigger server/scale-out my infrastructure. Why should I have to serve them?

progval · on March 26, 2021

You can use the same rate-limiting for all crawlers, Google or not.

dodobirdlord · on March 26, 2021

Googlebot is pretty careful and generally doesn’t cause these problems.

spijdar · on March 26, 2021

Right, then they shouldn't be effected by the rate-limiting, as long as its reasonable. If it was applied evenly to all clients/crawlers, it'd at least allow the possibility for a respectful, well designed crawler to compete.

jedberg · on March 26, 2021

The problem is, if you own a website, it takes the same amount of resources to handle the crawl from Google and FooCrawler even if both are behaving, but I'm going to get a lot more ROI out of letting Google crawl, so I'm incentivized to block FooCrawler but not Google. In fact, the ROI from Google is so high I'm incentivized to devote extra resources just for them to crawl faster.

progval · on March 26, 2021

We know that. No one claims websites are doing this for no reason. It's explicitly written in the article.

But this sub-thread is about misbehaved crawlers.

Exuma · on March 27, 2021

Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.

kristopolous · on March 26, 2021

In the early 90s there were various nascent systems for essentially public database interfaces for searching

The idea was that instead of a centralized search, people could have fat clients that individually query these apis and then aggregate the results on the client machine.

Essentially every query would be a what/where or what/who pair. This would focus the results

I really think we need to reboot those core ideas.

We have a manual version today. There's quite a few large databases that the crawlers don't get.

The one place for everything approach has the same fundamental problems that were pointed out 30 years ago, they've just become obvious to everybody now.

petercooper · on March 26, 2021

I wonder what happens to RSS feeds in this situation. Programs I run that process RSS feeds will just fetch them over HTTP completely headlessly, so if there are any CAPTCHAs, I'm not going to see them.

kevincox · on March 27, 2021

I've found that Cloudflare isn't great at this. I even found cases where my site was failing to load to googlebot (a "good" bot that they probably have the IPs for) because they were serving a captcha instead of my CSS.

So your best bet is setting a page rule to allow those URLs.

mqus · on March 27, 2021

In my experience, those either get detected(?) and let through (rss can be agressively cached after all) or you're out of luck and the website owner set up e.g. wordpress (which automatically included rss URLs) but did not configure cloudflare to let rss through.

luckylion · on March 26, 2021

That will be interesting to see with regards to legal implications. If they (in the website operator's name) block access to e.g. privacy info pages to a normal user "by accident", that could be a compliance issue.

I don't think it's mass blocking is the right approach in general. IPs, even residential, are relatively easy and relatively cheap. At some point you're blocking too many normal users. Captchas are a strong weapon, but they too have a significant cost by annoying the users. Cloudflare could theoretically do invisible-invisible captchas by never even running any code on the client, but that would be wholesale tracking and would probably not fly in the EU.

dannyw · on March 27, 2021

Cloudflare is an agent for website owners. Nearly everything is configurable and the defaults are permissive.

TameAntelope · on March 26, 2021

How hard is it to ask Cloudflare to let you crawl?

smarx007 · on March 26, 2021

It's not Cloudflare who is deciding it. It's the website owners who request things like "Super Bot Fight Mode". I never enable such things on my CF properties. Mostly it's people who manage websites with "valuable" content, e.g. shops with prices who desperately want to stop scraping by competitors.

f430 · on March 26, 2021

I can say this will give a lot of businesses false sense of security. It is already bypassable.

the Web scraping technology that I am aware of has reached end game already: Unless you are prepared to authenticate every user/visitor to your website with a dollar sign, lobby congress to pass a bill to outlaw web scraping, you will not be able to stop web scraping in 2021 and beyond.

djhn · on March 27, 2021

But what about captchas?

Due to aggressive no-script and uBlock use I, browsing the website as a human, keep getting hit by captchas and my success rate is falling to a coinflip. If there's a script to automate that I'm all ears.

f430 · on March 27, 2021

100% doable. Like I said these type of blanket throttling seems to be the new trend but it's already defeated.

I just no longer see it possible to 1) put information on the web (private or public) 2) give access outside your organization (customers or visitors) 3) expect your website will not be scraped.

ToS is NOT the law unfortunately.

grishka · on March 26, 2021

So, one more reason to hate Cloudflare and every single website that uses it.

jakear · on March 26, 2021

Or maybe don’t “hate” folks who are just trying to put some content online and don’t want to deal with botnets taking down their work? You know, like what the internet was intended for.

d110af5ccf · on March 26, 2021

> don’t want to deal with botnets taking down their work

Botnets and automated crawling are completely different things. This isn't about preventing service degradation (even if it gets presented that way). It's an attempt by content publishers to control who accesses their content and how.

Cloudflare is actively assisting their customers to do things I view as unethical. Worse, only Cloudflare (or someone in a similarly central position) is capable of doing those things in the first place.

grishka · on March 26, 2021

Internet was certainly not intended for centralization. I hit Cloudflare captchas and error pages so often it's almost sickening. So many things are behind Cloudflare, things you least expect to be behind Cloudflare.

wizzwizz4 · on March 26, 2021

It's easy enough to bypass most Cloudflare “anti-bot” with an unusual refresh pattern or messing with a cookie. (It's easier to script this than solve the CAPTCHAs.)

grishka · on March 27, 2021

Anyone malicious who is determined enough will just pay their way through any captcha. Yet for me, as a legitimate user, these "one more step" pages feel downright humiliating. At this point, if I see one, I either just nope out of it, or look for a saved copy on archive.org.