Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by DrkShadow on Tuesday July 05 2016, @10:06PM

    by DrkShadow (1404) on Tuesday July 05 2016, @10:06PM (#370316)

    There was one crawler that hit one of our sites so hard and fast that it exhausted the permitted number of database connections, filled the RAM, and brought the site down. This happend three times before I figured out what was going on. (That isn't my primary focus, but I now have a very specific method of looking through performance issues on that site.)

    Looking into it, I honestly don't care if you fake a user agent. I can tell you _which_ user-agents you're crawling by subnet. What I would like you to do _first_ is crawl the site (or even just visit the home page) with your real user agent (let me know what the subnet is associated with) and then, preferably on the same UTC day, crawl it again with whatever the hell you want.

    robots.txt? ALWAYS RESPECT THAT. That is the SOLE way of communicating with web bots. (I'm not going to sign up with brand-new XYZ marketing or other crawler.) If you ignore that, I use the firewall to prevent access from your subnet and any subnet associated with you. If they whitelist certain crawlers and not yours, DO NOT crawl that site. If you represent as googlebot and your IP whois info isn't registered to Google, g'bye.

    In general, I just want the site to stay up. If you can do things without causing problems, without causing excessive bandwidth costs, I'm not going to care what you do. If I put something in robots.txt about you, I've noticed you and I'm a hair's width from banning your bot, domain, IP range, and anything else that I can find out via whois, Google, user-agent matching (employee in the office?), or otherwise.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2