Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.
Question: Should a web crawler always reveal its true name?
Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.
I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm
This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)
So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?
(Score: 0) by Anonymous Coward on Tuesday July 05 2016, @08:49PM
if a copyright holder complains
NO. That's not what's being complained about.
So, on the one hand, we have a guy who hosts websites on his servers and owns all those hosted domains.
Let's call him Hostmaster.
On the other hand, we have a guy who rents one of those domains and generates all the **content** that appears on that website.
Let's call him Webmaster.
Webmaster is just fine with his content being archived.
Now, something changes (e.g. missed hosting payments) and Webmaster loses control of his domain|subdomain.
In the robots.txt for the domain formerly used by Webmaster, Hostmaster specifies that all that content is inaccessible.
archive.org GOES BACK IN TIME to the point where they already had permission to archive that content and had *done* so.
They now treat that content on archive.org's own servers as if it NEVER EXISTED.
.
Iin a slight plot twist, if Webmaster actually *owned* his domain and let his registration lapse, and Hostmaster subsequently snatched up that URL, we'd be at the same point.
This stuff is a problem of Capitalism|ownership|rent-seeking.
The logical argument to be made (which you are missing) is that the Intellectual Property still belongs to the (former) Webmaster who is still its creator.
...as well as the fact that Hostmaster is being a dick and that archive.org is siding with the dick.
-- OriginalOwner_ [soylentnews.org]
(Score: 2) by Scruffy Beard 2 on Wednesday July 06 2016, @05:03PM
You missed some nuance in my post. Maybe I was not clear.
I said they treat is "like" a take-down request.
I am aware that the domain holder may not be the copyright holder in many (most?) cases.