Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by GungnirSniper on Tuesday July 05 2016, @01:57PM

    by GungnirSniper (1671) on Tuesday July 05 2016, @01:57PM (#370064) Journal

    The worst abuse of robots.txt isn't for generic crawlers, but that the Internet Archive retroactively applies domain-wide disallows to to the entire history of a domain. This means sites that were once available there can disappear if the domain name registration expires and is picked up by a speculator or squatter who uses a stricter robots.txt. I've seen technical documentation disappear this way and it is disheartening.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Informative=2, Total=3
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:06PM

    by Anonymous Coward on Tuesday July 05 2016, @02:06PM (#370074)

    I thought it was supposed to be an archive? Not much of an archive if they just delete the history of an entire site based on some arbitrary nonsense.

  • (Score: 3, Insightful) by SomeGuy on Tuesday July 05 2016, @02:18PM

    by SomeGuy (5632) on Tuesday July 05 2016, @02:18PM (#370077)

    So they still haven't fixed this?

    From an archival standpoint, there is little difference between a changed robots.txt and a page that has simply been removed.

    Why make one retroactive but not the other?

    If there is content that the site author did not want archived... well they should not have put it on the web in the first place but besides that... they should directly request removal in the same way.

    • (Score: 2) by Scruffy Beard 2 on Tuesday July 05 2016, @03:46PM

      by Scruffy Beard 2 (6030) on Tuesday July 05 2016, @03:46PM (#370119)

      My understanding is that they treat it like a take-down request.

      They have a model of putting information out there and taking it down if a copyright holder complains.

      • (Score: 2) by GungnirSniper on Tuesday July 05 2016, @04:38PM

        by GungnirSniper (1671) on Tuesday July 05 2016, @04:38PM (#370146) Journal

        Surely there is a better way than an eternal takedown, because eventually nearly every domain is going to change hands. Archived sites that had great info years ago shouldn't just disappear forever because Sedo or GoDaddy or someone icky gets the registration. It would be like an ISBN number being reused cancelling the copyright on the prior published work.

        • (Score: 1, Informative) by Anonymous Coward on Tuesday July 05 2016, @05:53PM

          by Anonymous Coward on Tuesday July 05 2016, @05:53PM (#370172)

          Their take down isn't permanent. Every bit of data they collect is still there and some sites archives are still downloadable to the public as WARC files. Their policy is to take down access through the way back machine until the robot.txt disappears or copyright expires.

          • (Score: 1) by Chrontius on Wednesday July 06 2016, @11:47PM

            by Chrontius (5246) on Wednesday July 06 2016, @11:47PM (#371033)

            I've tried working with WARC files, but I'm still hazy on how to browse them. Do you have any guides you could point me to?

      • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @08:49PM

        by Anonymous Coward on Tuesday July 05 2016, @08:49PM (#370248)

        if a copyright holder complains

        NO. That's not what's being complained about.

        So, on the one hand, we have a guy who hosts websites on his servers and owns all those hosted domains.
        Let's call him Hostmaster.

        On the other hand, we have a guy who rents one of those domains and generates all the **content** that appears on that website.
        Let's call him Webmaster.
        Webmaster is just fine with his content being archived.

        Now, something changes (e.g. missed hosting payments) and Webmaster loses control of his domain|subdomain.

        In the robots.txt for the domain formerly used by Webmaster, Hostmaster specifies that all that content is inaccessible.

        archive.org GOES BACK IN TIME to the point where they already had permission to archive that content and had *done* so.
        They now treat that content on archive.org's own servers as if it NEVER EXISTED.

        .
        Iin a slight plot twist, if Webmaster actually *owned* his domain and let his registration lapse, and Hostmaster subsequently snatched up that URL, we'd be at the same point.

        This stuff is a problem of Capitalism|ownership|rent-seeking.
        The logical argument to be made (which you are missing) is that the Intellectual Property still belongs to the (former) Webmaster who is still its creator.

        ...as well as the fact that Hostmaster is being a dick and that archive.org is siding with the dick.

        -- OriginalOwner_ [soylentnews.org]

        • (Score: 2) by Scruffy Beard 2 on Wednesday July 06 2016, @05:03PM

          by Scruffy Beard 2 (6030) on Wednesday July 06 2016, @05:03PM (#370770)

          You missed some nuance in my post. Maybe I was not clear.

          I said they treat is "like" a take-down request.

          I am aware that the domain holder may not be the copyright holder in many (most?) cases.

  • (Score: 3, Interesting) by bradley13 on Tuesday July 05 2016, @03:53PM

    by bradley13 (3053) Subscriber Badge on Tuesday July 05 2016, @03:53PM (#370126) Homepage Journal

    Mixed feelings about Archive.org. I once was very happy to have this behavior.

    Getty Images once threatened to sue our micro-company over images we had on a website. We had purchased these from a smaller site that Getty bought; after Getty bought them, they apparently discarded the sales records. They were unimpressed by our physical receipts, because these didn't map directly to the image numbers.

    We could fight them in court, or we could pay them off for only $x thousand, a figure they set to be less than initial legal costs would have been. It's sort of like the ransomware out there, only they abuse the legal system instead of cryptography. Their timing was also great, just before Christmas, when they bloody well knew that most people didn't want to deal with crap.

    Anyhow, back to Archive.org: I was glad at the time to be able to take down not only the images, but also all copies at Archive.org, just to prevent any potential repeat of the idiocy. At the same time, this is a shame, as it means that Archive.org ails to be a true archive. It ought to show what was available at any particular point in time, regardless of later changes.

    Search for "getty images extortion" - they apparently play this game a lot. They also continue to buy other, smaller image sites - it's increasingly difficult to avoid them.

    --
    Everyone is somebody else's weirdo.