> Let’s take a look at the robots.txt for census.gov from October of 2018 as a s...

> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission.

This was eyebrow-raising. Actually seems like this is not (any longer?) true:

https://census.gov/robots.txt:

User-agent: *

User-agent: W3C-checklink

Disallow: /cgi-bin/

Disallow: /libs/

...

That first line wildcards for any user agent but does nothing with it. It should say "Disallow /" on the next line if it blocked all unnamed robots. It looks like someone found out about it and told the operators, rightfully so, that government webpages with public information (especially the census) shouldn't have such restrictions. They then removed only the second line and left the first. Leaving the first line has no impact on the meaning of the file.