Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 3, Insightful) by SomeGuy on Tuesday July 05 2016, @01:59PM
This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent. When I have contacted web masters about this all I get is some garbage like "der looks like malware herp derp, use chrome/firefox because security derrrr." Which I find completely incredible because there is zero reason why a malicious web client would not spoof the more popular user agents.
Also, thanks to all the dependence on scripting these days it seems like the only way to index a page is to load it in each of the big three web browsers (Firefox, Chrome, IE, I still miss Opera) with a "standard" configuration (no add blockers, scripting enabled, anus wide open ready for insertion) and then OCR it. Might as well go back to using Flash.
(Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:45PM
This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent.
Ah, yes. I use Pale Moon with the Random Agent Spoofer add-on (it's for FF, but also works just fine in PM). The spoofer randomizes my user agent whenever I restart the browser or click on the icon, to one of, I dunno, probably several hundreds of different OS/browser combinations, half of which I've never heard of (what the hell is Arora? Omniweb? Uzbl?).
I'll occasionally get warnings about unsupported/out-of-date browser or strange layout problems (Google Maps and Image Search are terrible at this), but manually setting the user agent to some recent version of FF or Chrome invariably fixes the problem.
What I'm trying to say is, if sites give you crap about your user agent, just fake it and pretend to be a recent version of Firefox or something.
Bonus: also helps with tracking, if you don't like websites following your every move. With an "oddball web browser" you're quite an easy prey, it's better to either periodically change it or settle on something a lot of people use...
(Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @08:20PM
My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent [string]
A decade ago, I hung out in the SeaMonkey newsgroup regularly.
Multiple times a week, we'd see folks reporting that SeaMonkey's (perfectly legit) UA string was getting rejected.
(Idiot web devs sniffing for "Firefox" instead of "Gecko".)
In more recent versions, the SeaMonkey devs have simply surrendered to the idiots and have identified SeaMonkey as Firefox by default.
zero reason
The phase when the "standard" was perceived to be "Internet Exploder" got things really screwed up.
If the idiots were going to sniff, the **logical** thing to do would have been to make a page that passed the HTML Validator and deliver **that**.
...UNLESS the least-conforming browser (IE) was detected, whereupon a page with a bunch of hacks would have been delivered (a *different* page for each *version* of that not-even-backwards-compatible browser).
have contacted web masters
In short, they were mostly sniffing when they don't need to then doing the wrong thing with the information they get.
Again, idiots who don't understand their job.
In summary, there are a great many people who are putting up websites who have no clue what they're doing and THEY DON'T CARE.
I see the way to deal with this in the same light as working around a bad boss:
Just tell them what they want to hear and get on with your life.
-- OriginalOwner_ [soylentnews.org]