Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 2) by Pino P on Thursday July 07 2016, @03:14PM
If the user has disabled all interactivity, it would seem prudent to have a sane default DOM that can be rendered reasonably well.
Does a link to instructions for troubleshooting problems with execution of JavaScript count as a "sane default DOM"?
even I have heard of CSS Media Queries
CSS Media Queries control in what manner the client renders the DOM it receives. They cannot control how much data the client receives. For example, a rule conditioned by a media query can activate display: none on certain element classes when the viewport width is less than a threshold. But it can't prevent the server from sending the HTML for those elements in the first place. Many users on 5-inch screens, perhaps most in my home country (United States), pay per bit for Internet data because metered data plans are standard practice among U.S. cellular carriers. Receiving data that will never be rendered still costs money.
(Score: 2) by coolgopher on Friday July 08 2016, @01:48AM
For your given example, sending or not sending the first sentence would make little difference. The typical* distribution of data usage for a web page these days seems to be, from heaviest to lightest:
1. Video
2. Images
3. Scripts(!)
4. HTML structure
5. Textual content.
Browsing on mobile with images and scripts off saves a disproportionate amount of downloads. If web masters and web developers could get out of the noxious habit of cramming all manner of junk onto a page with only a teensy tiny amount of relevant information on the page, we wouldn't even be having this discussion. Not to mention pulling in script dependencies like it's candy. See e.g. http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/. [theregister.co.uk] I *do* agree with you in principle on unrendered elements, don't get me wrong, but I think there are much bigger issues which need far more attention.
*) I'm sure I read a study on this not that long ago, but I can't seem to find the right search terms now. Links welcome.