Google's main business has been search, and now it wants to make a core part of it an internet standard.
The internet giant has outlined plans to turn robots exclusion protocol (REP) — better known as robots.txt — into an internet standard after 25 years. To that effect, it has also made its C++ robots.txt parser that underpins the Googlebot web crawler available on GitHub for anyone to access.
"We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers," Google said. "Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF."
The REP is one of the cornerstones of web search engines, and it helps website owners manage their server resources more easily. Web crawlers — like Googlebot — are how Google and other search engines routinely scan the internet to discover new web pages and add them to their list of known pages.
A follow-on post to Google's blog expands on the proposal.
The Draft Specification is available here. Google has put its open-source repository up on GitHub
(Score: 2, Interesting) by Anonymous Coward on Thursday July 04 2019, @06:18AM (6 children)
robots.txt discriminates by useragent, and many naive websites outright disallow any unrecognized wildcard useragents, while whitelisting established search engines. This effectively blocks any new search engines from competing. Their only recourse is to ignore robots.txt (or at least impersonate google while parsing it).
(Score: 2) by MostCynical on Thursday July 04 2019, @07:25AM (1 child)
Why wouldn't they impersonate bing?
"I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
(Score: 2) by kazzie on Thursday July 04 2019, @09:12PM
Because nobody wants to be mistaken for a pre-school bunny [wikipedia.org].
(Score: 2) by bradley13 on Thursday July 04 2019, @09:27AM (3 children)
"robots.txt discriminates by useragent, and many naive websites outright disallow any unrecognized wildcard useragents, while whitelisting established search engines."
So? Why is this a problem? If I don't want Joe Random crawling my site, then I tell him so. He is completely free to ignore my wishes.
AFAIK, robots.txt serves two purposes. First, it tells sites that they are unwelcome. For example, if you do not want your site archived, this is how you inform archive.org of your preference. Second, it allows to you inform crawlers about content that is irrelevant or useless to them - thus saving them and your server unnecessary effort. However, it remains an honor system, and may be freely disregarded, and there are even perfectly legitimate reasons for doing so.
Everyone is somebody else's weirdo.
(Score: 2, Interesting) by Anonymous Coward on Thursday July 04 2019, @10:56AM (2 children)
Three, if you want to include flagging 'interesting' parts of your site for the attentions of 'miscreants'..
I've just enabled 4 virtual hosts on a server, within hours I had a number of IP numbers attached to DSL lines dotted around the globe attempting to grab (non-existent) robots.txt files from these virtual sites, same IP numbers also tried various php exploits, MySQL exploits etc. etc.
The file has its uses, but seriously?, Google, of all the momsers, attempting to mandate its use seems a mite bloody strange....
(Score: 1, Interesting) by Anonymous Coward on Thursday July 04 2019, @02:12PM (1 child)
If you're actually using robots.txt to do that, then you're doing it wrong and deserve everything you get. If you think the absence of a robots.txt file is going to protect you, then you are sadly mistaken. Security by obscurity is never a good policy. The robots.txt file is there as a guideline for good actors, but no one has to absolutely respect it, not even all good actors (archive.org for one ignores robots.txt). Maybe these evil robots are looking for a default robots.txt file that's put there by some vulnerable package.
The mandate then, is largely for people who want to make web robots that try to be good neighbours, like Google's crawler tries to be, and for people who for whatever reason don't want portions of their site crawled by these well-behaved robots for whatever reason. Most RFCs, except for those in the Standards Track, aren't really mandates. A lot are just codifications of established practice. This one looks like one of the latter type.
(Score: 0) by Anonymous Coward on Friday July 05 2019, @03:20AM
People put all sorts of places in the robots.txt because they don't want robots firing off various scripts that they either won't care about the results of or can't fill out properly. For example, SoylentNews has Disallow: /search.pl in their robots.txt because going to that page causes a script to process their request, at a minimum, and potentially hitting the database.
However, as mentioned, that shows blackhats that there is definitely some script or something you don't want good crawlers to see.
But, if you really are worried about something like that, you can always put Disallow: /admin-control-panel.pl in your robots.txt too. Except that URL is a script that add their IP address to your firewall's blacklist, think fail2ban, denyhosts, OSSEC, or stockade.
(Score: 2) by c0lo on Thursday July 04 2019, @07:36AM (1 child)
... and we might be seeing the internet security a solved problem in our lifetime. Just one more standard [ietf.org].
point: respecting robots.txt, honest declaration of your user agent... If you can enforce them, what's the flip of a single bit?
https://www.youtube.com/watch?v=aoFiw2jMy-0
(Score: 1, Touché) by Anonymous Coward on Thursday July 04 2019, @02:47PM
(Score: 3, Insightful) by FatPhil on Thursday July 04 2019, @07:39AM (1 child)
They say it needs standardisation their way because some people have created ambiguously interpretable robots.txt files. Remind me how approval for their specification will stop website owners from creating ambiguously interpretable robots.txt files?
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 1, Interesting) by Anonymous Coward on Thursday July 04 2019, @09:59AM
Yes [github.com]
An official rather than de-facto standard is a much stronger legal defence against legislative extortion attempts from media outlets.
(Score: 3, Funny) by Bot on Thursday July 04 2019, @09:14AM (1 child)
>robot exclusion protocol
I'll sue.
(and if you think a bot suing is a funny thing, wait till the powerful ones, after corporate citizenship, will give bots legal protection, so we can serve them better).
Account abandoned.
(Score: -1, Troll) by Anonymous Coward on Thursday July 04 2019, @10:02AM
Wait until they standardize the white male exclusion protocol HR employ under the guise of diversity and inclusivity.
(Score: 3, Interesting) by kazzie on Thursday July 04 2019, @09:15PM
Okay, so Google has just proposed this.
Their altruistic days are arguably behind them, so that leaves me thinking: how do they benefit from this standardisation?