> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission.
This was eyebrow-raising. Actually seems like this is not (any longer?) true:
That first line wildcards for any user agent but does nothing with it. It should say "Disallow /" on the next line if it blocked all unnamed robots. It looks like someone found out about it and told the operators, rightfully so, that government webpages with public information (especially the census) shouldn't have such restrictions. They then removed only the second line and left the first. Leaving the first line has no impact on the meaning of the file.
Actually, thinking more about this, I think they might be misconfigured, because they clearly don't want robots touching /cgi-bin etc (reasonable!) but they are actually only asking the named robots to do that, all other bots have no guidance about what not to touch
This was eyebrow-raising. Actually seems like this is not (any longer?) true:
https://census.gov/robots.txt:
User-agent: *
User-agent: W3C-checklink
Disallow: /cgi-bin/
Disallow: /libs/
...
That first line wildcards for any user agent but does nothing with it. It should say "Disallow /" on the next line if it blocked all unnamed robots. It looks like someone found out about it and told the operators, rightfully so, that government webpages with public information (especially the census) shouldn't have such restrictions. They then removed only the second line and left the first. Leaving the first line has no impact on the meaning of the file.