Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday July 05 2016, @10:51AM   Printer-friendly
from the what-would-you-do? dept.

Disclaimer: I work on a search engine (findx). I try not to put competitors in a bad light.

Question: Should a web crawler always reveal its true name?

Background: While crawling the web I've found some situations where using a fake user-agent might help. First example is a web site that checks the user-agent in the http-request and returns a "your browser is not supported" - even for robots.txt. Another example is a site that had an explicit whitelist in robots.txt. Strangely, 'curl' was whitelisted but 'wget' was not. I hesitate in using a fake user-agent, e.g. googlebot because it isn't clear what the clueless webmasters' intentions are. It appears that some websites are misconfigured or so google-optimized that other/new search engines may have to resort to faking user-agent.

I'm also puzzled by Qwant because they claim to have their own search index but my personal website (which is clearly indexed when I search in qwant) has never been crawled by a user-agent resembling anything that could lead to qwant. Apparently they don't reveal what their user-agent is: https://blog.qwant.com/qwant-fr/. And there has been some discussion about it: https://www.webmasterworld.com/search_engine_spiders/4743502.htm

This is different from search engines that don't have their own index (eg. DuckDuckGo uses results from Yahoo! and yandex. Startpage uses Google, etc.)

So what do you Soylentils say, is faking the user-agent in webcrawls necessary? Acceptable? A necessary evil?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Insightful) by Anonymous Coward on Tuesday July 05 2016, @03:45PM

    by Anonymous Coward on Tuesday July 05 2016, @03:45PM (#370117)

    This. My oddball web browser gets blocked on a number of sites because of perfectly legitimate part of the user agent.

    Ah, yes. I use Pale Moon with the Random Agent Spoofer add-on (it's for FF, but also works just fine in PM). The spoofer randomizes my user agent whenever I restart the browser or click on the icon, to one of, I dunno, probably several hundreds of different OS/browser combinations, half of which I've never heard of (what the hell is Arora? Omniweb? Uzbl?).

    I'll occasionally get warnings about unsupported/out-of-date browser or strange layout problems (Google Maps and Image Search are terrible at this), but manually setting the user agent to some recent version of FF or Chrome invariably fixes the problem.

    What I'm trying to say is, if sites give you crap about your user agent, just fake it and pretend to be a recent version of Firefox or something.

    Bonus: also helps with tracking, if you don't like websites following your every move. With an "oddball web browser" you're quite an easy prey, it's better to either periodically change it or settle on something a lot of people use...

    Starting Score:    0  points
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  

    Total Score:   1