Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 1, Insightful) by stormreaver on Tuesday July 05 2016, @12:28PM
Using a fake user agent just encourages bad website configuration. If your search engine becomes populate enough to matter, then not indexing badly configured sites will encourage those sites' administrators to correct their errors. Otherwise, your site doesn't matter anyway.
If I want my site to be available only to Google, and you're pretending to be Google, then you're committing multiple felonies in violation of the Computer Fraud and Abuse Act. You know, the one that tends to be so heavily abused that it caused at least one prominent developer to kill himself rather than face its consequences. I may not matter enough for the TLAs to chase you, but you will eventually piss off one that is.
And finally: Don't be a dick. Use your software's actual user agent. If you're rejected by a Web site (for whatever reason), just move on.
(Score: 5, Informative) by coolgopher on Tuesday July 05 2016, @03:55PM
Sorry, but those claims are absolute bulldust. A user-agent header does not constitute "pretending to be Google" and sure as fuck doesn't violate laws. If you actually Read The F(ine|ucking) RFC ($5.5.3) you'll see that user-agent masquerading is explicitly mentioned and accepted if a client wishes to receive "responses tailored for the identified user agent".
(Score: 2) by stormreaver on Wednesday July 06 2016, @12:41AM
Your RFC argument is weak, at best (though not entirely outrageous). That being said, technical definitions are entirely irrelevant to legal proceedings. You clearly haven't seen the successful abuses perpetrated under the CFAA. Any such prosecution under the CFAA will likely include user-agent spoofing as falsifying your identity to a large enough corporation, which is entirely illegal under the law.
Also, read the article. The author is spoofing the user agent to get around explicit blockades put into place by the Web site owner (even if those blockades weren't explicitly meant for the author). Again, illegal hacking under the CFAA.
(Score: 1) by isj on Wednesday July 06 2016, @12:58PM
Also, read the article. The author is spoofing the user agent
I'd like to make it very clear that our crawler doesn't spoof the user-agent string.
I was curious if we are being naive by not spoofing it.
If by "article" you refer to the links I provided then yes there are indications that some crawlers are doing something fishy, or that their search index is actually provided by a 3rd party.
(Score: 1) by toddestan on Thursday July 07 2016, @02:57AM
Keep in mind that every major browser has been spoofing its user agent pretending to be Netscape for YEARS. The website owners may not like and could ban you (which would be well within their rights), but I wouldn't worry too much about getting dragged into federal court over it.