Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by Unixnut on Tuesday July 05 2016, @11:10AM

    by Unixnut (5779) on Tuesday July 05 2016, @11:10AM (#370010)

    I am also working on a search engine (although it is a reverse image search engine, and still in the research phase, so not publicly available) and when doing spidering I use the googlebot user agent. This is because quite a few sites and applications have become so Google centric that they don't know (or care) about alternatives. As a result they will filter through human user agents, and googlebot, but the rest they would assume are scrapers/bots and will be rejected.

    Even worse, some of them use Intrusion detection systems, which if they don't see your bot in their whitelist, will assume you are nefarious and trying to scrape the site, and block the IP for some time. As I only have one static IP, and quite a few sites are hosted by the same companies behind the same IDS, this can quickly result in me being denied access to a lot of places.

    So for the time being, I spoof the googlebot, and follow the same rules as that. A bit like how in the dark days of Microsoft, people would spoof IE user-agents to be able to view websites. At least until you become well known enough that people know your bot is ok to whitelist (and perhaps one day, will encourage its arrival).

    Starting Score:    1  point
    Moderation   +3  
       Interesting=2, Informative=1, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 5, Insightful) by datapharmer on Tuesday July 05 2016, @11:29AM

    by datapharmer (2702) on Tuesday July 05 2016, @11:29AM (#370013)

    And some of us use reverse lookup on all bot traffic and if it a google bot doesn't go back to a google IP they get redirected to a honeypot for bad bots. No offense, but this is bad advise - you are missing out on more of the web than you might realize by doing this.

    • (Score: 5, Insightful) by GungnirSniper on Tuesday July 05 2016, @01:01PM

      by GungnirSniper (1671) on Tuesday July 05 2016, @01:01PM (#370031) Journal

      So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?

      Aren't you missing out on potential readers when redirecting traffic to a honeypot and reinforcing Google's absolute dominance in the search space?

      • (Score: 0) by Anonymous Coward on Tuesday July 05 2016, @01:14PM

        by Anonymous Coward on Tuesday July 05 2016, @01:14PM (#370038)

        Ignore the misconfigured website. It is not important to your insignificant crawler.
        The other way around, your insignificant spam crawler which fakes UA is not important to my insignificant website either.

      • (Score: 1, Interesting) by Anonymous Coward on Tuesday July 05 2016, @01:32PM

        by Anonymous Coward on Tuesday July 05 2016, @01:32PM (#370048)

        > So what does he do when using a blank User Agent gets endless loops, and non-standard UAs get banned or blocked?

        Crawl the site multiple times with different UAs. Don't do it back-to-back, or at least not from the same IP address.

      • (Score: 2) by datapharmer on Thursday July 07 2016, @02:54PM

        by datapharmer (2702) on Thursday July 07 2016, @02:54PM (#371270)

        There is no penalty for not sharing, only for lying.

    • (Score: 2) by Unixnut on Tuesday July 05 2016, @01:28PM

      by Unixnut (5779) on Tuesday July 05 2016, @01:28PM (#370047)

      Thankfully I have not come across this experience yet, I think only once did the bot end up stuck in a bit of a loop, but that was more my poor coding not looking for infinite recursion than anything else. A bug I corrected.

      If a site really doesn't want to be spidered that bad, to the point where they redirect to honeypots, then I won't push it further. Using the googlebot user agent works better than a non-googlebot user agent. So for example, if my hit rate goes from 60% to 95% ( -5% who redirect to honeypots, because face it, few people are quite that paranoid ), that is still an improvement.

      On a related note, is there some sort of open source bot that I can use/modify, or does everyone just write their own bot, in some awful "lets reinvent the wheel a few hundred times" way? I know some example code out there (indeed what I used initially for mine), but no actual proper project with active development that I can find.

      • (Score: 3, Insightful) by TheRaven on Tuesday July 05 2016, @01:32PM

        by TheRaven (270) on Tuesday July 05 2016, @01:32PM (#370049) Journal
        Try with a few different user agents from a few different IPs. Penalise sites that serve different content to them.
        --
        sudo mod me up
        • (Score: 2) by Unixnut on Tuesday July 05 2016, @02:22PM

          by Unixnut (5779) on Tuesday July 05 2016, @02:22PM (#370079)

          That would work if I had multiple IPs (I don't), although based on the comments so far here, I may well alter it to randomly select a web browser user agent, and hope that no IDSes out there use pattern matching to see if my "browser" is actually behaving more like a bot than a browser.

          Really quite a bit of faff for what used to be really quite simple, crawling web pages.

          • (Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:50PM

            by Anonymous Coward on Tuesday July 05 2016, @03:50PM (#370122)

            You could try getting a VPN, they're not that expensive and you can usually choose between a few dozen different servers. Gives you a lot of different IP addresses to use.

            • (Score: 1) by toddestan on Thursday July 07 2016, @02:51AM

              by toddestan (4982) on Thursday July 07 2016, @02:51AM (#371102)

              If they think you're a VPN a bunch of sites will redirect you to Google's obnoxious recapatcha because they assume you're a bot or up to some other nefarious purpose. You really can't win.

      • (Score: 1) by isj on Tuesday July 05 2016, @05:56PM

        by isj (5249) on Tuesday July 05 2016, @05:56PM (#370174) Homepage

        On a related note, is there some sort of open source bot that I can use/modify, or does everyone just write their own bot, in some awful "lets reinvent the wheel a few hundred times" way? I know some example code out there (indeed what I used initially for mine), but no actual proper project with active development that I can find.

        We're using a fork of https://github.com/gigablast/open-source-search-engine/ [github.com] The code (C-style C++) is complex and large, but it offers some features that are hard to find in other projects.
        Crawling: I'm not aware of any projects specializing in that, but there must be some simple ones out there based on curl/wget and a bolted-on scheduler.
        Indexing and searching: If what you intend to index will be relatively uniform and comparable (and spam-free) and word order does not matter to you, then any of the engines supporting BM25 are faster and simpler.