Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Insightful) by ledow on Tuesday July 05 2016, @01:11PM

    by ledow (5567) on Tuesday July 05 2016, @01:11PM (#370035) Homepage

    It's untrusted data.

    Any program running on a server relying on that data to be accurate is basically in error.

    However, unless you have a REASON to fake it, I don't see why you should. If you can't "see" something with, say, a Chrome UA, why should you bother to index it etc. at all? Just let it die out on its own while it only allow curl or whatever.

    That said, robots.txt is a nonsense too. You really think that it's going to stop your stuff being found, especially if you end up having to list "what not to look at".

    As far as I'm concerned a fake UA isn't a problem but if we start going down the road of "every browser has a uniquely random UA", then we'd be being more honest and accurate ("Don't try and guess my browser, just give me the HTML page I asked for") but at the same time browser usage stats for websites become useless. I can't say I'd miss them but we'd never again be able to tell if Edge is beating out Chrome or whatever. Not that anything along those lines is even vaguely accurate anyway, precisely because the UA can just be made up or cloned.

    Starting Score:    1  point
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 4, Interesting) by Bill Dimm on Tuesday July 05 2016, @02:00PM

    by Bill Dimm (940) on Tuesday July 05 2016, @02:00PM (#370069)

    That said, robots.txt is a nonsense too. You really think that it's going to stop your stuff being found, especially if you end up having to list "what not to look at".

    No, I don't think it will stop the content from being found, but it will tell a well-behaved bot that it would be stupid to try to index it. If they insist on trying to index it anyway, what will stop them is when I ban their IP address from accessing the site at all. There are plenty of legitimate reasons for indicating that some pages shouldn't be indexed -- the content may be temporary, or it may be highly redundant (e.g., a list of links to similar articles). If your bot decides to plow through it anyway, despite the site owner who understands the content much better than your bot going to the trouble to tell your bot what is useful to index and what isn't, you are just demonstrating that your bot isn't worth tolerating at all.

  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @08:43PM

    by Anonymous Coward on Tuesday July 05 2016, @08:43PM (#370246)

    why should you bother?

    Oh, man. You've lost your Indiana Jones mojo.

    Just let it die out

    What if it's the last place on the planet containing the answer to life, the universe, and everything?

    Don't try and guess my browser, just give me the HTML page I asked for

    Now, based on that platform, anytime you choose to run, you've got my vote for President of the Internet.

    -- OriginalOwner_ [soylentnews.org]