Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by coolgopher on Tuesday July 05 2016, @11:01AM

    by coolgopher (1157) Subscriber Badge on Tuesday July 05 2016, @11:01AM (#370008)

    Sites relying on the User-Agent header need to DIAF. If your site needs specific feature support only available in some user agents, test for the feature, not the user agent. The user-agent header was always a bad idea, and it became absolutely terrible thanks to IE's lack of standards compliance.

    Unless you have a prior agreement with a service provider that says that you should be accurate with the user-agent, you really don't need to. Heck, you could simply omit the user-agent as it's not a required header in RFC7231.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Informative=2, Total=3
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 5, Funny) by cockroach on Tuesday July 05 2016, @11:47AM

    by cockroach (2266) on Tuesday July 05 2016, @11:47AM (#370018)

    For a fun time try setting your user agent to an empty string -- there's a surprising lot of pages that will refuse to load or end up in redirection loops.

    • (Score: 1, Informative) by Anonymous Coward on Wednesday July 06 2016, @07:38AM

      by Anonymous Coward on Wednesday July 06 2016, @07:38AM (#370528)

      There's an Apache module that does this - I think it's the one named "mod_security".

      Our monitoring system at work does not send the user-agent header, so we've had to turn the stupid module off a couple of times on the one Apache web server some moron introduced[1]. Apparently the hosting company turns it back on once in a while.

      Oh sure, we could fix the monitoring system. But why, when the alternative is that the moron who absolutely had to have an Apache server is the one suffering for his choice.

      [1] we are five developers, and we already span .NET, iOS and Android, we don't need any extra platforms.

  • (Score: 3, Insightful) by SomeGuy on Tuesday July 05 2016, @01:59PM

    by SomeGuy (5632) on Tuesday July 05 2016, @01:59PM (#370068)

    Sites relying on the User-Agent header need to DIAF.

    This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent. When I have contacted web masters about this all I get is some garbage like "der looks like malware herp derp, use chrome/firefox because security derrrr." Which I find completely incredible because there is zero reason why a malicious web client would not spoof the more popular user agents.

    Also, thanks to all the dependence on scripting these days it seems like the only way to index a page is to load it in each of the big three web browsers (Firefox, Chrome, IE, I still miss Opera) with a "standard" configuration (no add blockers, scripting enabled, anus wide open ready for insertion) and then OCR it. Might as well go back to using Flash.

    • (Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:45PM

      by Anonymous Coward on Tuesday July 05 2016, @03:45PM (#370117)

      This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent.

      Ah, yes. I use Pale Moon with the Random Agent Spoofer add-on (it's for FF, but also works just fine in PM). The spoofer randomizes my user agent whenever I restart the browser or click on the icon, to one of, I dunno, probably several hundreds of different OS/browser combinations, half of which I've never heard of (what the hell is Arora? Omniweb? Uzbl?).

      I'll occasionally get warnings about unsupported/out-of-date browser or strange layout problems (Google Maps and Image Search are terrible at this), but manually setting the user agent to some recent version of FF or Chrome invariably fixes the problem.

      What I'm trying to say is, if sites give you crap about your user agent, just fake it and pretend to be a recent version of Firefox or something.

      Bonus: also helps with tracking, if you don't like websites following your every move. With an "oddball web browser" you're quite an easy prey, it's better to either periodically change it or settle on something a lot of people use...

    • (Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @08:20PM

      by Anonymous Coward on Tuesday July 05 2016, @08:20PM (#370235)

      My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent [string]

      A decade ago, I hung out in the SeaMonkey newsgroup regularly.
      Multiple times a week, we'd see folks reporting that SeaMonkey's (perfectly legit) UA string was getting rejected.
      (Idiot web devs sniffing for "Firefox" instead of "Gecko".)

      In more recent versions, the SeaMonkey devs have simply surrendered to the idiots and have identified SeaMonkey as Firefox by default.

      zero reason

      The phase when the "standard" was perceived to be "Internet Exploder" got things really screwed up.
      If the idiots were going to sniff, the **logical** thing to do would have been to make a page that passed the HTML Validator and deliver **that**.
      ...UNLESS the least-conforming browser (IE) was detected, whereupon a page with a bunch of hacks would have been delivered (a *different* page for each *version* of that not-even-backwards-compatible browser).

      have contacted web masters

      In short, they were mostly sniffing when they don't need to then doing the wrong thing with the information they get.
      Again, idiots who don't understand their job.

      In summary, there are a great many people who are putting up websites who have no clue what they're doing and THEY DON'T CARE.
      I see the way to deal with this in the same light as working around a bad boss:
      Just tell them what they want to hear and get on with your life.

      -- OriginalOwner_ [soylentnews.org]

  • (Score: 2) by darkfeline on Tuesday July 05 2016, @03:35PM

    by darkfeline (1030) on Tuesday July 05 2016, @03:35PM (#370112) Homepage

    >Heck, you could simply omit the user-agent as it's not a required header in RFC7231.

    You COULD, but you shouldn't.

    The point of the user agent is to identify your client so bugs can be tracked down to the right source. For example, if I notice a huge torrent of traffic with the user agent lib version 1.2, I can tell you about it and then you can go and look, and discover a bug where if settings A and B are set, it will infinite loop.

    The point is NOT to vary the content based on the user agent's supported feature set. HTTP already has a method for content negotiation: the Accept header.

    Nowadays the user agent header is fucking useless since everyone just sends the same few variations of Mozilla/5.0.

    --
    Join the SDF Public Access UNIX System today!
  • (Score: 2) by Justin Case on Wednesday July 06 2016, @01:33AM

    by Justin Case (4239) on Wednesday July 06 2016, @01:33AM (#370401) Journal

    Sites relying on the User-Agent header need to DIAF.

    +1. If your site cares what device or software I'm using, you're doing it wrong and need to be banished to the fiery depths.

    robots.txt is equally silly, meaningless, and there only to waste everybody's time with a "standard" that is a joke.

    Spider it first with a common User-agent, or even be honest that you're a spider. Obey robots.txt.

    Once you have everything you can find that way, hit it again with some lies. Use some throwaway IPs if you can.

    If anything changes you've found an idiot. Congratulations. They need to feel some pain. Hit them with everything you've got. Do it for England and the Queen.

  • (Score: 2) by Pino P on Thursday July 07 2016, @12:52AM

    by Pino P (4721) on Thursday July 07 2016, @12:52AM (#371069) Journal

    How do you "test for the feature" if the user is refusing to run any scripts (e.g. JS turned off, JS files and elements blocked at proxy, or NoScript or LibreJS extensions)? How do you test for screen size if you want to send a list of article titles to user agents on 4 to 5 inch screens but titles and the first sentence to user agents on tablets or desktops?

    • (Score: 2) by coolgopher on Thursday July 07 2016, @07:20AM

      by coolgopher (1157) Subscriber Badge on Thursday July 07 2016, @07:20AM (#371167)

      A1: If the user has disabled all interactivity, it would seem prudent to have a sane default DOM that can be rendered reasonably well. A blank page doesn't classify as sane, unless its name is "white.html", btw.

      A2: I'm not a web developer, but even I have heard of CSS Media Queries; heck I've even used them on occasion. Presumably there are even more tools available. Try looking around the accessibility stuff, that's usually quite enlightening.

      • (Score: 2) by Pino P on Thursday July 07 2016, @03:14PM

        by Pino P (4721) on Thursday July 07 2016, @03:14PM (#371279) Journal

        If the user has disabled all interactivity, it would seem prudent to have a sane default DOM that can be rendered reasonably well.

        Does a link to instructions for troubleshooting problems with execution of JavaScript count as a "sane default DOM"?

        even I have heard of CSS Media Queries

        CSS Media Queries control in what manner the client renders the DOM it receives. They cannot control how much data the client receives. For example, a rule conditioned by a media query can activate display: none on certain element classes when the viewport width is less than a threshold. But it can't prevent the server from sending the HTML for those elements in the first place. Many users on 5-inch screens, perhaps most in my home country (United States), pay per bit for Internet data because metered data plans are standard practice among U.S. cellular carriers. Receiving data that will never be rendered still costs money.

        • (Score: 2) by coolgopher on Friday July 08 2016, @01:48AM

          by coolgopher (1157) Subscriber Badge on Friday July 08 2016, @01:48AM (#371583)

          For your given example, sending or not sending the first sentence would make little difference. The typical* distribution of data usage for a web page these days seems to be, from heaviest to lightest:
          1. Video
          2. Images
          3. Scripts(!)
          4. HTML structure
          5. Textual content.

          Browsing on mobile with images and scripts off saves a disproportionate amount of downloads. If web masters and web developers could get out of the noxious habit of cramming all manner of junk onto a page with only a teensy tiny amount of relevant information on the page, we wouldn't even be having this discussion. Not to mention pulling in script dependencies like it's candy. See e.g. http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/. [theregister.co.uk] I *do* agree with you in principle on unrendered elements, don't get me wrong, but I think there are much bigger issues which need far more attention.

          *) I'm sure I read a study on this not that long ago, but I can't seem to find the right search terms now. Links welcome.