Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 4, Interesting) by datapharmer on Tuesday July 05 2016, @11:27AM
As someone who has banned search bots for being disrespectful I suggest the following:
-Always, ALWAYS follow robots.txt (I honeypot the disallow listed for that oh-so-interesting admin url you see, so this is your fastest way to get banned forever)
-List what user agents and IP addresses I can expect on your own website, and make sure this can be found through OTHER major search engines
-Apply for whitelisting of your IP ranges through all the services you can find and leave notes at places like project honeypot that you are a search engine and how to find out more information
-A browser UA is fine, but don't use another bot's UA.
-Follow crawl rates and if the site starts to get more laggy back off and come back later!
The reason for these "rules" are because if I see a great deal of automated traffic from an unknown crawler my first instinct is it is malicious, because more often than not it is. Also, if I see something claiming to be Bing, Google, Yandex etc. and it doesn't reverse lookup to their IP ranges I also assume it is trying to fool me for some reason that more often than not is not out there to help me. My website is ultimately for humans, so bots of any kind are treated as second class citizens and eyed with suspicion.
All that said, use a browser UA if you need to get the page to render. Heck, use a couple in case the content is different depending on the UA - one might be easier for your bot to index.
(Score: 1) by isj on Tuesday July 05 2016, @04:20PM
I honeypot the disallow listed for that oh-so-interesting admin url you see, so this is your fastest way to get banned forever
I have done the same thing since 2003 but never seen any requests for that non-existing URL in robots.txt. Is your honeypot on a well-known path eg /wpadmin or similar?
(Score: 2) by datapharmer on Thursday July 07 2016, @02:52PM
I've got both one in robots.txt and some common urls (phpmyadmin etc). The admin urls get filled up pretty regularly, the disallow I've only seen a handful of times across maybe 2 or 3 of my sites.