Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by coolgopher on Tuesday July 05 2016, @11:01AM

    by coolgopher (1157) Subscriber Badge on Tuesday July 05 2016, @11:01AM (#370008)

    Sites relying on the User-Agent header need to DIAF. If your site needs specific feature support only available in some user agents, test for the feature, not the user agent. The user-agent header was always a bad idea, and it became absolutely terrible thanks to IE's lack of standards compliance.

    Unless you have a prior agreement with a service provider that says that you should be accurate with the user-agent, you really don't need to. Heck, you could simply omit the user-agent as it's not a required header in RFC7231.

    • (Score: 5, Funny) by cockroach on Tuesday July 05 2016, @11:47AM

      by cockroach (2266) on Tuesday July 05 2016, @11:47AM (#370018)

      For a fun time try setting your user agent to an empty string -- there's a surprising lot of pages that will refuse to load or end up in redirection loops.

      • (Score: 1, Informative) by Anonymous Coward on Wednesday July 06 2016, @07:38AM

        by Anonymous Coward on Wednesday July 06 2016, @07:38AM (#370528)

        There's an Apache module that does this - I think it's the one named "mod_security".

        Our monitoring system at work does not send the user-agent header, so we've had to turn the stupid module off a couple of times on the one Apache web server some moron introduced[1]. Apparently the hosting company turns it back on once in a while.

        Oh sure, we could fix the monitoring system. But why, when the alternative is that the moron who absolutely had to have an Apache server is the one suffering for his choice.

        [1] we are five developers, and we already span .NET, iOS and Android, we don't need any extra platforms.

    • (Score: 3, Insightful) by SomeGuy on Tuesday July 05 2016, @01:59PM

      by SomeGuy (5632) on Tuesday July 05 2016, @01:59PM (#370068)

      Sites relying on the User-Agent header need to DIAF.

      This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent. When I have contacted web masters about this all I get is some garbage like "der looks like malware herp derp, use chrome/firefox because security derrrr." Which I find completely incredible because there is zero reason why a malicious web client would not spoof the more popular user agents.

      Also, thanks to all the dependence on scripting these days it seems like the only way to index a page is to load it in each of the big three web browsers (Firefox, Chrome, IE, I still miss Opera) with a "standard" configuration (no add blockers, scripting enabled, anus wide open ready for insertion) and then OCR it. Might as well go back to using Flash.

      • (Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:45PM

        by Anonymous Coward on Tuesday July 05 2016, @03:45PM (#370117)

        This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent.

        Ah, yes. I use Pale Moon with the Random Agent Spoofer add-on (it's for FF, but also works just fine in PM). The spoofer randomizes my user agent whenever I restart the browser or click on the icon, to one of, I dunno, probably several hundreds of different OS/browser combinations, half of which I've never heard of (what the hell is Arora? Omniweb? Uzbl?).

        I'll occasionally get warnings about unsupported/out-of-date browser or strange layout problems (Google Maps and Image Search are terrible at this), but manually setting the user agent to some recent version of FF or Chrome invariably fixes the problem.

        What I'm trying to say is, if sites give you crap about your user agent, just fake it and pretend to be a recent version of Firefox or something.

        Bonus: also helps with tracking, if you don't like websites following your every move. With an "oddball web browser" you're quite an easy prey, it's better to either periodically change it or settle on something a lot of people use...

      • (Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @08:20PM

        by Anonymous Coward on Tuesday July 05 2016, @08:20PM (#370235)

        My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent [string]

        A decade ago, I hung out in the SeaMonkey newsgroup regularly.
        Multiple times a week, we'd see folks reporting that SeaMonkey's (perfectly legit) UA string was getting rejected.
        (Idiot web devs sniffing for "Firefox" instead of "Gecko".)

        In more recent versions, the SeaMonkey devs have simply surrendered to the idiots and have identified SeaMonkey as Firefox by default.

        zero reason

        The phase when the "standard" was perceived to be "Internet Exploder" got things really screwed up.
        If the idiots were going to sniff, the **logical** thing to do would have been to make a page that passed the HTML Validator and deliver **that**.
        ...UNLESS the least-conforming browser (IE) was detected, whereupon a page with a bunch of hacks would have been delivered (a *different* page for each *version* of that not-even-backwards-compatible browser).

        have contacted web masters

        In short, they were mostly sniffing when they don't need to then doing the wrong thing with the information they get.
        Again, idiots who don't understand their job.

        In summary, there are a great many people who are putting up websites who have no clue what they're doing and THEY DON'T CARE.
        I see the way to deal with this in the same light as working around a bad boss:
        Just tell them what they want to hear and get on with your life.

        -- OriginalOwner_ [soylentnews.org]

    • (Score: 2) by darkfeline on Tuesday July 05 2016, @03:35PM

      by darkfeline (1030) on Tuesday July 05 2016, @03:35PM (#370112) Homepage

      >Heck, you could simply omit the user-agent as it's not a required header in RFC7231.

      You COULD, but you shouldn't.

      The point of the user agent is to identify your client so bugs can be tracked down to the right source. For example, if I notice a huge torrent of traffic with the user agent lib version 1.2, I can tell you about it and then you can go and look, and discover a bug where if settings A and B are set, it will infinite loop.

      The point is NOT to vary the content based on the user agent's supported feature set. HTTP already has a method for content negotiation: the Accept header.

      Nowadays the user agent header is fucking useless since everyone just sends the same few variations of Mozilla/5.0.

      --
      Join the SDF Public Access UNIX System today!
    • (Score: 2) by Justin Case on Wednesday July 06 2016, @01:33AM

      by Justin Case (4239) on Wednesday July 06 2016, @01:33AM (#370401) Journal

      Sites relying on the User-Agent header need to DIAF.

      +1. If your site cares what device or software I'm using, you're doing it wrong and need to be banished to the fiery depths.

      robots.txt is equally silly, meaningless, and there only to waste everybody's time with a "standard" that is a joke.

      Spider it first with a common User-agent, or even be honest that you're a spider. Obey robots.txt.

      Once you have everything you can find that way, hit it again with some lies. Use some throwaway IPs if you can.

      If anything changes you've found an idiot. Congratulations. They need to feel some pain. Hit them with everything you've got. Do it for England and the Queen.

    • (Score: 2) by Pino P on Thursday July 07 2016, @12:52AM

      by Pino P (4721) on Thursday July 07 2016, @12:52AM (#371069) Journal

      How do you "test for the feature" if the user is refusing to run any scripts (e.g. JS turned off, JS files and elements blocked at proxy, or NoScript or LibreJS extensions)? How do you test for screen size if you want to send a list of article titles to user agents on 4 to 5 inch screens but titles and the first sentence to user agents on tablets or desktops?

      • (Score: 2) by coolgopher on Thursday July 07 2016, @07:20AM

        by coolgopher (1157) Subscriber Badge on Thursday July 07 2016, @07:20AM (#371167)

        A1: If the user has disabled all interactivity, it would seem prudent to have a sane default DOM that can be rendered reasonably well. A blank page doesn't classify as sane, unless its name is "white.html", btw.

        A2: I'm not a web developer, but even I have heard of CSS Media Queries; heck I've even used them on occasion. Presumably there are even more tools available. Try looking around the accessibility stuff, that's usually quite enlightening.

        • (Score: 2) by Pino P on Thursday July 07 2016, @03:14PM

          by Pino P (4721) on Thursday July 07 2016, @03:14PM (#371279) Journal

          If the user has disabled all interactivity, it would seem prudent to have a sane default DOM that can be rendered reasonably well.

          Does a link to instructions for troubleshooting problems with execution of JavaScript count as a "sane default DOM"?

          even I have heard of CSS Media Queries

          CSS Media Queries control in what manner the client renders the DOM it receives. They cannot control how much data the client receives. For example, a rule conditioned by a media query can activate display: none on certain element classes when the viewport width is less than a threshold. But it can't prevent the server from sending the HTML for those elements in the first place. Many users on 5-inch screens, perhaps most in my home country (United States), pay per bit for Internet data because metered data plans are standard practice among U.S. cellular carriers. Receiving data that will never be rendered still costs money.

          • (Score: 2) by coolgopher on Friday July 08 2016, @01:48AM

            by coolgopher (1157) Subscriber Badge on Friday July 08 2016, @01:48AM (#371583)

            For your given example, sending or not sending the first sentence would make little difference. The typical* distribution of data usage for a web page these days seems to be, from heaviest to lightest:
            1. Video
            2. Images
            3. Scripts(!)
            4. HTML structure
            5. Textual content.

            Browsing on mobile with images and scripts off saves a disproportionate amount of downloads. If web masters and web developers could get out of the noxious habit of cramming all manner of junk onto a page with only a teensy tiny amount of relevant information on the page, we wouldn't even be having this discussion. Not to mention pulling in script dependencies like it's candy. See e.g. http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/. [theregister.co.uk] I *do* agree with you in principle on unrendered elements, don't get me wrong, but I think there are much bigger issues which need far more attention.

            *) I'm sure I read a study on this not that long ago, but I can't seem to find the right search terms now. Links welcome.

  • (Score: 5, Interesting) by Unixnut on Tuesday July 05 2016, @11:10AM

    by Unixnut (5779) on Tuesday July 05 2016, @11:10AM (#370010)

    I am also working on a search engine (although it is a reverse image search engine, and still in the research phase, so not publicly available) and when doing spidering I use the googlebot user agent. This is because quite a few sites and applications have become so Google centric that they don't know (or care) about alternatives. As a result they will filter through human user agents, and googlebot, but the rest they would assume are scrapers/bots and will be rejected.

    Even worse, some of them use Intrusion detection systems, which if they don't see your bot in their whitelist, will assume you are nefarious and trying to scrape the site, and block the IP for some time. As I only have one static IP, and quite a few sites are hosted by the same companies behind the same IDS, this can quickly result in me being denied access to a lot of places.

    So for the time being, I spoof the googlebot, and follow the same rules as that. A bit like how in the dark days of Microsoft, people would spoof IE user-agents to be able to view websites. At least until you become well known enough that people know your bot is ok to whitelist (and perhaps one day, will encourage its arrival).

    • (Score: 5, Insightful) by datapharmer on Tuesday July 05 2016, @11:29AM

      by datapharmer (2702) on Tuesday July 05 2016, @11:29AM (#370013)

      And some of us use reverse lookup on all bot traffic and if it a google bot doesn't go back to a google IP they get redirected to a honeypot for bad bots. No offense, but this is bad advise - you are missing out on more of the web than you might realize by doing this.

      • (Score: 5, Insightful) by GungnirSniper on Tuesday July 05 2016, @01:01PM

        by GungnirSniper (1671) on Tuesday July 05 2016, @01:01PM (#370031) Journal

        So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?

        Aren't you missing out on potential readers when redirecting traffic to a honeypot and reinforcing Google's absolute dominance in the search space?

        • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @01:14PM

          by Anonymous Coward on Tuesday July 05 2016, @01:14PM (#370038)

          Ignore the misconfigured website. It is not important to your insignificant crawler.
          The other way around, your insignificant spam crawler which fakes UA is not important to my insignificant website either.

        • (Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @01:32PM

          by Anonymous Coward on Tuesday July 05 2016, @01:32PM (#370048)

          > So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?

          Crawl the site multiple times with different UAs. Don't do it back-to-back, or at least not from the same IP address.

        • (Score: 2) by datapharmer on Thursday July 07 2016, @02:54PM

          by datapharmer (2702) on Thursday July 07 2016, @02:54PM (#371270)

          There is no penalty for not sharing, only for lying.

      • (Score: 2) by Unixnut on Tuesday July 05 2016, @01:28PM

        by Unixnut (5779) on Tuesday July 05 2016, @01:28PM (#370047)

        Thankfully I have not come across this experience yet, I think only once did the bot end up stuck in a bit of a loop, but that was more my poor coding not looking for infinite recursion than anything else. A bug I corrected.

        If a site really doesn't want to be spidered that bad, to the point where they redirect to honeypots, then I won't push it further. Using the googlebot user agent works better than a non-googlebot user agent. So for example, if my hit rate goes from 60% to 95% ( -5% who redirect to honeypots, because face it, few people are quite that paranoid ), that is still an improvement.

        On a related note, is there some sort of open source bot that I can use/modify, or does everyone just write their own bot, in some awful "lets reinvent the wheel a few hundred times" way? I know some example code out there (indeed what I used initially for mine), but no actual proper project with active development that I can find.

        • (Score: 3, Insightful) by TheRaven on Tuesday July 05 2016, @01:32PM

          by TheRaven (270) on Tuesday July 05 2016, @01:32PM (#370049) Journal
          Try with a few different user agents from a few different IPs. Penalise sites that serve different content to them.
          --
          sudo mod me up
          • (Score: 2) by Unixnut on Tuesday July 05 2016, @02:22PM

            by Unixnut (5779) on Tuesday July 05 2016, @02:22PM (#370079)

            That would work if I had multiple IPs (I don't), although based on the comments so far here, I may well alter it to randomly select a web browser user agent, and hope that no IDSes out there use pattern matching to see if my "browser" is actually behaving more like a bot than a browser.

            Really quite a bit of faff for what used to be really quite simple, crawling web pages.

            • (Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:50PM

              by Anonymous Coward on Tuesday July 05 2016, @03:50PM (#370122)

              You could try getting a VPN, they're not that expensive and you can usually choose between a few dozen different servers. Gives you a lot of different IP addresses to use.

              • (Score: 1) by toddestan on Thursday July 07 2016, @02:51AM

                by toddestan (4982) on Thursday July 07 2016, @02:51AM (#371102)

                If they think you're a VPN a bunch of sites will redirect you to Google's obnoxious recapatcha because they assume you're a bot or up to some other nefarious purpose. You really can't win.

        • (Score: 1) by isj on Tuesday July 05 2016, @05:56PM

          by isj (5249) on Tuesday July 05 2016, @05:56PM (#370174) Homepage

          On a related note, is there some sort of open source bot that I can use/modify, or does everyone just write their own bot, in some awful "lets reinvent the wheel a few hundred times" way? I know some example code out there (indeed what I used initially for mine), but no actual proper project with active development that I can find.

          We're using a fork of https://github.com/gigablast/open-source-search-engine/ [github.com] The code (C-style C++) is complex and large, but it offers some features that are hard to find in other projects.
          Crawling: I'm not aware of any projects specializing in that, but there must be some simple ones out there based on curl/wget and a bolted-on scheduler.
          Indexing and searching: If what you intend to index will be relatively uniform and comparable (and spam-free) and word order does not matter to you, then any of the engines supporting BM25 are faster and simpler.

  • (Score: 4, Interesting) by datapharmer on Tuesday July 05 2016, @11:27AM

    by datapharmer (2702) on Tuesday July 05 2016, @11:27AM (#370011)

    As someone who has banned search bots for being disrespectful I suggest the following:

    -Always, ALWAYS follow robots.txt (I honeypot the disallow listed for that oh-so-interesting admin url you see, so this is your fastest way to get banned forever)
    -List what user agents and IP addresses I can expect on your own website, and make sure this can be found through OTHER major search engines
    -Apply for whitelisting of your IP ranges through all the services you can find and leave notes at places like project honeypot that you are a search engine and how to find out more information
    -A browser UA is fine, but don't use another bot's UA.
    -Follow crawl rates and if the site starts to get more laggy back off and come back later!

    The reason for these "rules" are because if I see a great deal of automated traffic from an unknown crawler my first instinct is it is malicious, because more often than not it is. Also, if I see something claiming to be Bing, Google, Yandex etc. and it doesn't reverse lookup to their IP ranges I also assume it is trying to fool me for some reason that more often than not is not out there to help me. My website is ultimately for humans, so bots of any kind are treated as second class citizens and eyed with suspicion.

    All that said, use a browser UA if you need to get the page to render. Heck, use a couple in case the content is different depending on the UA - one might be easier for your bot to index.

    • (Score: 1) by isj on Tuesday July 05 2016, @04:20PM

      by isj (5249) on Tuesday July 05 2016, @04:20PM (#370141) Homepage

      I honeypot the disallow listed for that oh-so-interesting admin url you see, so this is your fastest way to get banned forever
      I have done the same thing since 2003 but never seen any requests for that non-existing URL in robots.txt. Is your honeypot on a well-known path eg /wpadmin or similar?

      • (Score: 2) by datapharmer on Thursday July 07 2016, @02:52PM

        by datapharmer (2702) on Thursday July 07 2016, @02:52PM (#371267)

        I've got both one in robots.txt and some common urls (phpmyadmin etc). The admin urls get filled up pretty regularly, the disallow I've only seen a handful of times across maybe 2 or 3 of my sites.

  • (Score: 4, Informative) by Anonymous Coward on Tuesday July 05 2016, @11:31AM

    by Anonymous Coward on Tuesday July 05 2016, @11:31AM (#370014)

    It will get your IP banned on a whole range of sites that match the bot UA with known google IP's and ban everything that doesn't fit.

  • (Score: 1, Insightful) by stormreaver on Tuesday July 05 2016, @12:28PM

    by stormreaver (5101) on Tuesday July 05 2016, @12:28PM (#370021)

    Using a fake user agent just encourages bad website configuration. If your search engine becomes populate enough to matter, then not indexing badly configured sites will encourage those sites' administrators to correct their errors. Otherwise, your site doesn't matter anyway.

    If I want my site to be available only to Google, and you're pretending to be Google, then you're committing multiple felonies in violation of the Computer Fraud and Abuse Act. You know, the one that tends to be so heavily abused that it caused at least one prominent developer to kill himself rather than face its consequences. I may not matter enough for the TLAs to chase you, but you will eventually piss off one that is.

    And finally: Don't be a dick. Use your software's actual user agent. If you're rejected by a Web site (for whatever reason), just move on.

    • (Score: 5, Informative) by coolgopher on Tuesday July 05 2016, @03:55PM

      by coolgopher (1157) Subscriber Badge on Tuesday July 05 2016, @03:55PM (#370127)

      Sorry, but those claims are absolute bulldust. A user-agent header does not constitute "pretending to be Google" and sure as fuck doesn't violate laws. If you actually Read The F(ine|ucking) RFC ($5.5.3) you'll see that user-agent masquerading is explicitly mentioned and accepted if a client wishes to receive "responses tailored for the identified user agent".

      • (Score: 2) by stormreaver on Wednesday July 06 2016, @12:41AM

        by stormreaver (5101) on Wednesday July 06 2016, @12:41AM (#370370)

        Your RFC argument is weak, at best (though not entirely outrageous). That being said, technical definitions are entirely irrelevant to legal proceedings. You clearly haven't seen the successful abuses perpetrated under the CFAA. Any such prosecution under the CFAA will likely include user-agent spoofing as falsifying your identity to a large enough corporation, which is entirely illegal under the law.

        Also, read the article. The author is spoofing the user agent to get around explicit blockades put into place by the Web site owner (even if those blockades weren't explicitly meant for the author). Again, illegal hacking under the CFAA.

        • (Score: 1) by isj on Wednesday July 06 2016, @12:58PM

          by isj (5249) on Wednesday July 06 2016, @12:58PM (#370610) Homepage

          Also, read the article. The author is spoofing the user agent

          I'd like to make it very clear that our crawler doesn't spoof the user-agent string.

          I was curious if we are being naive by not spoofing it.

          If by "article" you refer to the links I provided then yes there are indications that some crawlers are doing something fishy, or that their search index is actually provided by a 3rd party.

        • (Score: 1) by toddestan on Thursday July 07 2016, @02:57AM

          by toddestan (4982) on Thursday July 07 2016, @02:57AM (#371105)

          Keep in mind that every major browser has been spoofing its user agent pretending to be Netscape for YEARS. The website owners may not like and could ban you (which would be well within their rights), but I wouldn't worry too much about getting dragged into federal court over it.

  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @12:33PM

    by Anonymous Coward on Tuesday July 05 2016, @12:33PM (#370023)

    Personally I think if a site is returning different content depending on user agent, it's a good idea to make a survey of which pages are affected by requesting the contents of the entire site with every possible user agent in existence. Because this is obviously slow, it can be done in parallel (different ip addresses come also handy if the number of connections is limited). I also recommend redoing the check periodically to see whether the situation has changed.

    It would be great if browsers had this functionality too. They could do it on the background silently and then the results would be already available when you pull out the developer sidebar.

  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @01:08PM

    by Anonymous Coward on Tuesday July 05 2016, @01:08PM (#370034)

    Your stealth crawling is not important to my website.
    Over 50k IPs (mostly aws) blocked for faking user agent, not respecting robots.
    Claim to be google from aws -> perma ban. Simple as that spammer.

    • (Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @01:53PM

      by Anonymous Coward on Tuesday July 05 2016, @01:53PM (#370057)

      You most likely blocked ~10K legitimate techy users. Good job.

      With advice like this floating around nowadays, http://www.ghacks.net/2016/02/26/read-articles-behind-paywalls-by-masquerading-as-googlebot/ [ghacks.net] *grins*

      • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:04PM

        by Anonymous Coward on Tuesday July 05 2016, @02:04PM (#370072)

        Most likely? Do you even know what kind of site they run? With so little information, how have you determined the likelihood of such a thing?

        • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:26PM

          by Anonymous Coward on Tuesday July 05 2016, @02:26PM (#370083)

          Well let's see, if it's a website that expects to be indexed by search engines at all then it'll probably be public facing or else it should be behind a HTTP login prompt at the very slightest, being worried about spammers means there's probably some kind of form or other means for submitting things to the site, both of those cases hint that random users may stumble upon the site and get perma banned for pretty much no reason whatsoever and without warning.

      • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:29PM

        by Anonymous Coward on Tuesday July 05 2016, @02:29PM (#370085)

        "Users" that request 100 pages per day over and over again, dozens per minute, with alternating UA, never requesting an image, css or javascript. Right.

        It is your problem willingly using with bad hosters / IP blocks that do not care about malicious clients like AWS.

    • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:29PM

      by Anonymous Coward on Tuesday July 05 2016, @02:29PM (#370086)

      Your stealth crawling is not important to my website.

      And your website is not important to my stealth crawling.

      When nobody can find your web site because you have blocked the latest hot new search tool... well, too bad, you don't exist anymore.

      • (Score: -1, Flamebait) by Anonymous Coward on Tuesday July 05 2016, @02:33PM

        by Anonymous Coward on Tuesday July 05 2016, @02:33PM (#370087)

        Your new hot new search tool that needs to lie about itself is not important to anyone.

        • (Score: 1, Touché) by Anonymous Coward on Wednesday July 06 2016, @04:26AM

          by Anonymous Coward on Wednesday July 06 2016, @04:26AM (#370472)

          I'm sorry, I can't hear you. You seem to be busy not existing. :P

  • (Score: 3, Insightful) by ledow on Tuesday July 05 2016, @01:11PM

    by ledow (5567) on Tuesday July 05 2016, @01:11PM (#370035) Homepage

    It's untrusted data.

    Any program running on a server relying on that data to be accurate is basically in error.

    However, unless you have a REASON to fake it, I don't see why you should. If you can't "see" something with, say, a Chrome UA, why should you bother to index it etc. at all? Just let it die out on its own while it only allow curl or whatever.

    That said, robots.txt is a nonsense too. You really think that it's going to stop your stuff being found, especially if you end up having to list "what not to look at".

    As far as I'm concerned a fake UA isn't a problem but if we start going down the road of "every browser has a uniquely random UA", then we'd be being more honest and accurate ("Don't try and guess my browser, just give me the HTML page I asked for") but at the same time browser usage stats for websites become useless. I can't say I'd miss them but we'd never again be able to tell if Edge is beating out Chrome or whatever. Not that anything along those lines is even vaguely accurate anyway, precisely because the UA can just be made up or cloned.

    • (Score: 4, Interesting) by Bill Dimm on Tuesday July 05 2016, @02:00PM

      by Bill Dimm (940) on Tuesday July 05 2016, @02:00PM (#370069)

      That said, robots.txt is a nonsense too. You really think that it's going to stop your stuff being found, especially if you end up having to list "what not to look at".

      No, I don't think it will stop the content from being found, but it will tell a well-behaved bot that it would be stupid to try to index it. If they insist on trying to index it anyway, what will stop them is when I ban their IP address from accessing the site at all. There are plenty of legitimate reasons for indicating that some pages shouldn't be indexed -- the content may be temporary, or it may be highly redundant (e.g., a list of links to similar articles). If your bot decides to plow through it anyway, despite the site owner who understands the content much better than your bot going to the trouble to tell your bot what is useful to index and what isn't, you are just demonstrating that your bot isn't worth tolerating at all.

    • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @08:43PM

      by Anonymous Coward on Tuesday July 05 2016, @08:43PM (#370246)

      why should you bother?

      Oh, man. You've lost your Indiana Jones mojo.

      Just let it die out

      What if it's the last place on the planet containing the answer to life, the universe, and everything?

      Don't try and guess my browser, just give me the HTML page I asked for

      Now, based on that platform, anytime you choose to run, you've got my vote for President of the Internet.

      -- OriginalOwner_ [soylentnews.org]

  • (Score: 4, Insightful) by gman003 on Tuesday July 05 2016, @01:56PM

    by gman003 (4155) on Tuesday July 05 2016, @01:56PM (#370061)

    I can see arguments against this, but I'd go with "findxbot/1.0 (like Googlebot)". This is an accepted pattern for user agent strings for browsers (most Webkit browsers contain "KHTML, like Gecko" in their layout engine descriptor, for example), and it is compatible with simple string checks but still allows your bot to be specifically excluded if desired (someone wanting to distinguish you from actual Googlebot just needs to check for "findxbot").

    The argument against this, of course, is that some people have decided to auto-blacklist anyone coming from a non-Google IP with a Googlebot agent. If they're doing it by just checking for the string "Googlebot", you'll get blacklisted. But I would be willing to call that behavior wrong, and live with occasional blacklisting (I do not expect this to be common behavior).

    • (Score: 1) by isj on Tuesday July 05 2016, @03:02PM

      by isj (5249) on Tuesday July 05 2016, @03:02PM (#370098) Homepage

      We currently use the user-agent string "Mozilla/5.0 (compatible; Findxbot/1.0; +http://www.findxbot.com)", and the entry we check in robots.txt is for "findxbot".

      It would be risky to mention any other bot's name in the user-agent because if that bot misbehaves then we would get hit by angry webmasters too.

    • (Score: 2) by butthurt on Tuesday July 05 2016, @06:46PM

      by butthurt (6141) on Tuesday July 05 2016, @06:46PM (#370197) Journal

      That behaviour is fairly common, I think. When I used the Googlebot user-agent for interactive browsing I encountered it.

  • (Score: 5, Informative) by GungnirSniper on Tuesday July 05 2016, @01:57PM

    by GungnirSniper (1671) on Tuesday July 05 2016, @01:57PM (#370064) Journal

    The worst abuse of robots.txt isn't for generic crawlers, but that the Internet Archive retroactively applies domain-wide disallows to to the entire history of a domain. This means sites that were once available there can disappear if the domain name registration expires and is picked up by a speculator or squatter who uses a stricter robots.txt. I've seen technical documentation disappear this way and it is disheartening.

    • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @02:06PM

      by Anonymous Coward on Tuesday July 05 2016, @02:06PM (#370074)

      I thought it was supposed to be an archive? Not much of an archive if they just delete the history of an entire site based on some arbitrary nonsense.

    • (Score: 3, Insightful) by SomeGuy on Tuesday July 05 2016, @02:18PM

      by SomeGuy (5632) on Tuesday July 05 2016, @02:18PM (#370077)

      So they still haven't fixed this?

      From an archival standpoint, there is little difference between a changed robots.txt and a page that has simply been removed.

      Why make one retroactive but not the other?

      If there is content that the site author did not want archived... well they should not have put it on the web in the first place but besides that... they should directly request removal in the same way.

      • (Score: 2) by Scruffy Beard 2 on Tuesday July 05 2016, @03:46PM

        by Scruffy Beard 2 (6030) on Tuesday July 05 2016, @03:46PM (#370119)

        My understanding is that they treat it like a take-down request.

        They have a model of putting information out there and taking it down if a copyright holder complains.

        • (Score: 2) by GungnirSniper on Tuesday July 05 2016, @04:38PM

          by GungnirSniper (1671) on Tuesday July 05 2016, @04:38PM (#370146) Journal

          Surely there is a better way than an eternal takedown, because eventually nearly every domain is going to change hands. Archived sites that had great info years ago shouldn't just disappear forever because Sedo or GoDaddy or someone icky gets the registration. It would be like an ISBN number being reused cancelling the copyright on the prior published work.

          • (Score: 1, Informative) by Anonymous Coward on Tuesday July 05 2016, @05:53PM

            by Anonymous Coward on Tuesday July 05 2016, @05:53PM (#370172)

            Their take down isn't permanent. Every bit of data they collect is still there and some sites archives are still downloadable to the public as WARC files. Their policy is to take down access through the way back machine until the robot.txt disappears or copyright expires.

            • (Score: 1) by Chrontius on Wednesday July 06 2016, @11:47PM

              by Chrontius (5246) on Wednesday July 06 2016, @11:47PM (#371033)

              I've tried working with WARC files, but I'm still hazy on how to browse them. Do you have any guides you could point me to?

        • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @08:49PM

          by Anonymous Coward on Tuesday July 05 2016, @08:49PM (#370248)

          if a copyright holder complains

          NO. That's not what's being complained about.

          So, on the one hand, we have a guy who hosts websites on his servers and owns all those hosted domains.
          Let's call him Hostmaster.

          On the other hand, we have a guy who rents one of those domains and generates all the **content** that appears on that website.
          Let's call him Webmaster.
          Webmaster is just fine with his content being archived.

          Now, something changes (e.g. missed hosting payments) and Webmaster loses control of his domain|subdomain.

          In the robots.txt for the domain formerly used by Webmaster, Hostmaster specifies that all that content is inaccessible.

          archive.org GOES BACK IN TIME to the point where they already had permission to archive that content and had *done* so.
          They now treat that content on archive.org's own servers as if it NEVER EXISTED.

          .
          Iin a slight plot twist, if Webmaster actually *owned* his domain and let his registration lapse, and Hostmaster subsequently snatched up that URL, we'd be at the same point.

          This stuff is a problem of Capitalism|ownership|rent-seeking.
          The logical argument to be made (which you are missing) is that the Intellectual Property still belongs to the (former) Webmaster who is still its creator.

          ...as well as the fact that Hostmaster is being a dick and that archive.org is siding with the dick.

          -- OriginalOwner_ [soylentnews.org]

          • (Score: 2) by Scruffy Beard 2 on Wednesday July 06 2016, @05:03PM

            by Scruffy Beard 2 (6030) on Wednesday July 06 2016, @05:03PM (#370770)

            You missed some nuance in my post. Maybe I was not clear.

            I said they treat is "like" a take-down request.

            I am aware that the domain holder may not be the copyright holder in many (most?) cases.

    • (Score: 3, Interesting) by bradley13 on Tuesday July 05 2016, @03:53PM

      by bradley13 (3053) Subscriber Badge on Tuesday July 05 2016, @03:53PM (#370126) Homepage Journal

      Mixed feelings about Archive.org. I once was very happy to have this behavior.

      Getty Images once threatened to sue our micro-company over images we had on a website. We had purchased these from a smaller site that Getty bought; after Getty bought them, they apparently discarded the sales records. They were unimpressed by our physical receipts, because these didn't map directly to the image numbers.

      We could fight them in court, or we could pay them off for only $x thousand, a figure they set to be less than initial legal costs would have been. It's sort of like the ransomware out there, only they abuse the legal system instead of cryptography. Their timing was also great, just before Christmas, when they bloody well knew that most people didn't want to deal with crap.

      Anyhow, back to Archive.org: I was glad at the time to be able to take down not only the images, but also all copies at Archive.org, just to prevent any potential repeat of the idiocy. At the same time, this is a shame, as it means that Archive.org ails to be a true archive. It ought to show what was available at any particular point in time, regardless of later changes.

      Search for "getty images extortion" - they apparently play this game a lot. They also continue to buy other, smaller image sites - it's increasingly difficult to avoid them.

      --
      Everyone is somebody else's weirdo.
  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @06:12PM

    by Anonymous Coward on Tuesday July 05 2016, @06:12PM (#370180)

    First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt.

    1: Spoof a current version of a common browser (so, Firefox). 2: go in if wget or curl are allowed in robots.txt, respect it to the letter otherwise. 3: in particular, don't go in if it only allows googlebot.
    4: If you want, append your bot's name to the end.

    the rationale for these is
    1: poorly configured sites are increasingly common, being "Firefox" lets you get at the site as it was intended to be displayed
    2: wget and curl can scrape a whole site, if a site allows them at all, there isn't a rationale for other bots being blocked (particularly since there's quite a few that use wget/curl to do their work, so their UA string will be wget or curl's)
    3: if a site only allows googlebot, they'll probably block anything else scraping, so just go with what it says
    4: this may/may not override 1 -- some misconfigured sites will probably balk at the addition, but it allows you to be specifically whitelisted/blacklisted

  • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @06:23PM

    by Anonymous Coward on Tuesday July 05 2016, @06:23PM (#370186)

    That's pretty basic. It's interesting how many people disagree.

  • (Score: 2) by DrkShadow on Tuesday July 05 2016, @10:06PM

    by DrkShadow (1404) on Tuesday July 05 2016, @10:06PM (#370316)

    There was one crawler that hit one of our sites so hard and fast that it exhausted the permitted number of database connections, filled the RAM, and brought the site down. This happend three times before I figured out what was going on. (That isn't my primary focus, but I now have a very specific method of looking through performance issues on that site.)

    Looking into it, I honestly don't care if you fake a user agent. I can tell you _which_ user-agents you're crawling by subnet. What I would like you to do _first_ is crawl the site (or even just visit the home page) with your real user agent (let me know what the subnet is associated with) and then, preferably on the same UTC day, crawl it again with whatever the hell you want.

    robots.txt? ALWAYS RESPECT THAT. That is the SOLE way of communicating with web bots. (I'm not going to sign up with brand-new XYZ marketing or other crawler.) If you ignore that, I use the firewall to prevent access from your subnet and any subnet associated with you. If they whitelist certain crawlers and not yours, DO NOT crawl that site. If you represent as googlebot and your IP whois info isn't registered to Google, g'bye.

    In general, I just want the site to stay up. If you can do things without causing problems, without causing excessive bandwidth costs, I'm not going to care what you do. If I put something in robots.txt about you, I've noticed you and I'm a hair's width from banning your bot, domain, IP range, and anything else that I can find out via whois, Google, user-agent matching (employee in the office?), or otherwise.

  • (Score: 4, Funny) by shortscreen on Wednesday July 06 2016, @02:18AM

    by shortscreen (2252) on Wednesday July 06 2016, @02:18AM (#370422) Journal

    I love all the comments on this story touting their own arbitrary policies along with tricks for getting around everyone else's arbitrary policies. Hurray for web standards.

  • (Score: 0) by Anonymous Coward on Wednesday July 06 2016, @10:18PM

    by Anonymous Coward on Wednesday July 06 2016, @10:18PM (#370996)

    imho sending or reacting to user-agent should be forbidden.

    anyone making a webserver react/log user agent strings, or a webbrowser by default send one, SHOULD get prison for five years or so.

    (due to the current sad state the web is in, browsers still have to make it easy for the user to set whatever he wants for it for individual fucked up websites, but let's hope that it won't be needed anymore in 10 years or so. For now I suppose you can program your crawler to detect if a server is writing "browser is not supported" in the text in the resulting page etc and then retry with a couple of user-agent strings for that website. Never send anything to normal websites though. Perhaps your search site can have a warning at search results links for those that need a certain user-agent?)