Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Thursday May 21 2015, @08:28PM   Printer-friendly
from the hackers.txt dept.

Robots.txt files are simple text files that website owners put in directories to keep web crawlers like Google, Yahoo, from indexing the contents of that directory. It's a game of trust, web masters don't actually trust the spiders to not access every file in the directories, they just expect these documents not to appear in search engines. By and large, the bargain has been kept.

But hackers have made no such bargain, and the mere presence of robots.txt files are like a X on a treasure map. And web site owners get careless, and, yes, some operate under the delusion that the promise of the spiders actually protects these documents.

The Register has an article that explains that hackers and rogue web crawlers, actually use robots.txt files to find directories worth crawling.

Melbourne penetration tester Thiebauld Weksteen is warning system administrators that robots.txt files can give attackers valuable information on potential targets by giving them clues about directories their owners are trying to protect.

Once a hacker gets into a system, it is standard reconnaissance practice to compile and update detailed lists of interesting sub directories by harvesting robots.txt files. It requires less than 100 lines of code.

If you watch your logs, you've probably seen web crawler tracks, and you've probably seen some just walk right past your robots.txt files. If you are smart there really isn't anything of value "protected" by your robots.txt. But the article lists some examples of people who should know better leaving lots of sensitive information hiding behind a robots.txt.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by maxwell demon on Thursday May 21 2015, @09:45PM

    by maxwell demon (1608) on Thursday May 21 2015, @09:45PM (#186212) Journal

    What about making a robots.txt tar pit? Make a page listed nowhere but in robots.txt (and coming very early there), and have the web server return an endless stream of arbitrary uninteresting data (say, an ordered list of all positive integers), but with pauses after each chunk; just enough to not have the other side terminate the connection.

    A normal user would not find the page, and if he finds it, would likely soon stop loading after he recognized the pattern. A well-behaving bot will not visit the page as it is listed as not to visit in robots.txt. A badly behaving bot will load the page, but lack the intelligence to recognize its nature, and thus continue loading until either an internal timeout for page loads triggers, or the bot runs out of memory, or until the bot gets killed for other reasons.

    --
    The Tao of math: The numbers you can count are not the real numbers.
    Starting Score:    1  point
    Moderation   +3  
       Interesting=2, Funny=1, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 3, Interesting) by Adamsjas on Thursday May 21 2015, @10:19PM

    by Adamsjas (4507) on Thursday May 21 2015, @10:19PM (#186230)

    The story indicates that some security people actually do that, and bog down those spiders that ignore robots.txt.
    None of the details about how to do that were provided.

    Sometimes the files/directories under robots.txt are used only by the web server (fetched and embedded in real time). Other times the URL of those files might appear in the transmitted html, relying on the remote browser to fetch them. (Very common for images and icons and such, and even pdfs.

    You would have to distinguish such a browser request from a crawler request, or limit yourself to using server side includes.
    You could also use an index.htm(l) in those directories that was the tar pit. Direct hits to a file would never be affected, but crawlers usually retrieve directory contents somehow.

    • (Score: 0) by Anonymous Coward on Thursday May 21 2015, @11:53PM

      by Anonymous Coward on Thursday May 21 2015, @11:53PM (#186257)

      > Sometimes the files/directories under robots.txt are used only by the web server (fetched and embedded in real time)

      Files such as these should not be publicly available on the website.

  • (Score: 2, Interesting) by Anonymous Coward on Thursday May 21 2015, @10:23PM

    by Anonymous Coward on Thursday May 21 2015, @10:23PM (#186232)

    Not going to work. Most bots parallelize tasks and have a fairly short timeout period per request. The ultra tryhard attackers will even use a distributed botnet to stop you from IP blocking them with trap pages.

    • (Score: 2, Interesting) by Anonymous Coward on Friday May 22 2015, @12:17AM

      by Anonymous Coward on Friday May 22 2015, @12:17AM (#186263)

      Some of the more intelligent ones will even try the URLs and see if the bot gets banned; if so, the rest of the bots will ignore URLs like that.

    • (Score: 2) by maxwell demon on Friday May 22 2015, @06:53AM

      by maxwell demon (1608) on Friday May 22 2015, @06:53AM (#186344) Journal

      Most bots parallelize tasks and have a fairly short timeout period per request.

      But from those bots, you could protect your "private" resources by simply delaying their delivery by a short time, triggering their timeout before your page is delivered. Sure, your users will see a short lag, but hey, who didn't ever wait a few seconds for a web page to appear?

      --
      The Tao of math: The numbers you can count are not the real numbers.