Are there any actual repercussions for just ignoring robots.txt?

asciident · on March 26, 2021

There is if you are doing it for work. For example, your company could get sued if you are found using that data and ignoring the ToS. If you are a public figure, you could get your name tarnished as doing something unethical or the media may call it "hacking". If you are rereleasing the data then you risk getting a takedown notice.

eertami · on March 27, 2021

robots.txt is not a terms of service. Even if it was, it wouldn't be enforceable for a public website. You would need to prove that a web crawler is maliciously causing disruption to your service, and that is not easy.

asciident · on March 27, 2021

All it takes is your company execs or lawyers to be afraid of a stern letter, and ask you to cancel your project. If you're violating their robots.txt, you're probably violating their terms of service that's hidden somewhere. And your company doesn't want to risk having to pay hundreds of thousands to fight a court case. There's also venues besides courts for them to attack you, like contacting the publishers or hosting platforms for your derivative works. It's a chilling effect.

And I'm not making this up. This kind of stuff has happened to me many times.

5560675260 · on March 26, 2021

Your crawler's IP might get banned, eventually.

the_dege · on March 26, 2021

Sometimes website admins will also try to report your ips to the service provider as a source of attacks (even if not true).

DocTomoe · on March 26, 2021

Given how often I've had misbehaving crawlers slow own servers in the early 2000s, I do not see how a crawler that disobeys robots.txt is not an attempted attack.