Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @08:20PM
My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent [string]
A decade ago, I hung out in the SeaMonkey newsgroup regularly.
Multiple times a week, we'd see folks reporting that SeaMonkey's (perfectly legit) UA string was getting rejected.
(Idiot web devs sniffing for "Firefox" instead of "Gecko".)
In more recent versions, the SeaMonkey devs have simply surrendered to the idiots and have identified SeaMonkey as Firefox by default.
zero reason
The phase when the "standard" was perceived to be "Internet Exploder" got things really screwed up.
If the idiots were going to sniff, the **logical** thing to do would have been to make a page that passed the HTML Validator and deliver **that**.
...UNLESS the least-conforming browser (IE) was detected, whereupon a page with a bunch of hacks would have been delivered (a *different* page for each *version* of that not-even-backwards-compatible browser).
have contacted web masters
In short, they were mostly sniffing when they don't need to then doing the wrong thing with the information they get.
Again, idiots who don't understand their job.
In summary, there are a great many people who are putting up websites who have no clue what they're doing and THEY DON'T CARE.
I see the way to deal with this in the same light as working around a bad boss:
Just tell them what they want to hear and get on with your life.
-- OriginalOwner_ [soylentnews.org]