Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @01:08PM
Your stealth crawling is not important to my website.
Over 50k IPs (mostly aws) blocked for faking user agent, not respecting robots.
Claim to be google from aws -> perma ban. Simple as that spammer.
(Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @01:53PM
You most likely blocked ~10K legitimate techy users. Good job.
With advice like this floating around nowadays, http://www.ghacks.net/2016/02/26/read-articles-behind-paywalls-by-masquerading-as-googlebot/ [ghacks.net] *grins*
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:04PM
Most likely? Do you even know what kind of site they run? With so little information, how have you determined the likelihood of such a thing?
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:26PM
Well let's see, if it's a website that expects to be indexed by search engines at all then it'll probably be public facing or else it should be behind a HTTP login prompt at the very slightest, being worried about spammers means there's probably some kind of form or other means for submitting things to the site, both of those cases hint that random users may stumble upon the site and get perma banned for pretty much no reason whatsoever and without warning.
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:29PM
"Users" that request 100 pages per day over and over again, dozens per minute, with alternating UA, never requesting an image, css or javascript. Right.
It is your problem willingly using with bad hosters / IP blocks that do not care about malicious clients like AWS.
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:29PM
And your website is not important to my stealth crawling.
When nobody can find your web site because you have blocked the latest hot new search tool... well, too bad, you don't exist anymore.
(Score: -1, Flamebait) by Anonymous Coward on Tuesday July 05 2016, @02:33PM
Your new hot new search tool that needs to lie about itself is not important to anyone.
(Score: 1, Touché) by Anonymous Coward on Wednesday July 06 2016, @04:26AM
I'm sorry, I can't hear you. You seem to be busy not existing. :P