Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 5, Insightful) by GungnirSniper on Tuesday July 05 2016, @01:01PM
So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?
Aren't you missing out on potential readers when redirecting traffic to a honeypot and reinforcing Google's absolute dominance in the search space?
Tips for better submissions to help our site grow. [soylentnews.org]
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @01:14PM
Ignore the misconfigured website. It is not important to your insignificant crawler.
The other way around, your insignificant spam crawler which fakes UA is not important to my insignificant website either.
(Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @01:32PM
> So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?
Crawl the site multiple times with different UAs. Don't do it back-to-back, or at least not from the same IP address.
(Score: 2) by datapharmer on Thursday July 07 2016, @02:54PM
There is no penalty for not sharing, only for lying.