Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 4, Insightful) by gman003 on Tuesday July 05 2016, @01:56PM
I can see arguments against this, but I'd go with "findxbot/1.0 (like Googlebot)". This is an accepted pattern for user agent strings for browsers (most Webkit browsers contain "KHTML, like Gecko" in their layout engine descriptor, for example), and it is compatible with simple string checks but still allows your bot to be specifically excluded if desired (someone wanting to distinguish you from actual Googlebot just needs to check for "findxbot").
The argument against this, of course, is that some people have decided to auto-blacklist anyone coming from a non-Google IP with a Googlebot agent. If they're doing it by just checking for the string "Googlebot", you'll get blacklisted. But I would be willing to call that behavior wrong, and live with occasional blacklisting (I do not expect this to be common behavior).
(Score: 1) by isj on Tuesday July 05 2016, @03:02PM
We currently use the user-agent string "Mozilla/5.0 (compatible; Findxbot/1.0; +http://www.findxbot.com)", and the entry we check in robots.txt is for "findxbot".
It would be risky to mention any other bot's name in the user-agent because if that bot misbehaves then we would get hit by angry webmasters too.
(Score: 2) by butthurt on Tuesday July 05 2016, @06:46PM
That behaviour is fairly common, I think. When I used the Googlebot user-agent for interactive browsing I encountered it.