Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Thursday May 21 2015, @08:28PM   Printer-friendly
from the hackers.txt dept.

Robots.txt files are simple text files that website owners put in directories to keep web crawlers like Google, Yahoo, from indexing the contents of that directory. It's a game of trust, web masters don't actually trust the spiders to not access every file in the directories, they just expect these documents not to appear in search engines. By and large, the bargain has been kept.

But hackers have made no such bargain, and the mere presence of robots.txt files are like a X on a treasure map. And web site owners get careless, and, yes, some operate under the delusion that the promise of the spiders actually protects these documents.

The Register has an article that explains that hackers and rogue web crawlers, actually use robots.txt files to find directories worth crawling.

Melbourne penetration tester Thiebauld Weksteen is warning system administrators that robots.txt files can give attackers valuable information on potential targets by giving them clues about directories their owners are trying to protect.

Once a hacker gets into a system, it is standard reconnaissance practice to compile and update detailed lists of interesting sub directories by harvesting robots.txt files. It requires less than 100 lines of code.

If you watch your logs, you've probably seen web crawler tracks, and you've probably seen some just walk right past your robots.txt files. If you are smart there really isn't anything of value "protected" by your robots.txt. But the article lists some examples of people who should know better leaving lots of sensitive information hiding behind a robots.txt.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by bob_super on Thursday May 21 2015, @09:30PM

    by bob_super (1357) on Thursday May 21 2015, @09:30PM (#186208)

    Excuse me if I'm missing something, but that still seems backwards... Shouldn't stuff "the admin wishes to hide" be hidden from all but those proving they have the proper security credentials? This way, it's hidden by default, not dependent on some other process to properly filter it out?

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 0) by Anonymous Coward on Thursday May 21 2015, @11:02PM

    by Anonymous Coward on Thursday May 21 2015, @11:02PM (#186243)

    > Shouldn't stuff "the admin wishes to hide" be hidden from all but those proving they have the proper security credentials?

    You can't both openly serve information to the public and simultaneously not serve it to the public.

    If you want to lock it down to a select group and only a select group, there are tons of ways to do that. Every webserver since 1995 has had some form of auth functionality.

    But if you want to make it pseudo-public so that humans see it without extra effort, but automated systems do not, then you are looking at these inbetween options.

    • (Score: 0) by Anonymous Coward on Friday May 22 2015, @02:33AM

      by Anonymous Coward on Friday May 22 2015, @02:33AM (#186295)

      To slow down the automated crawling and caching of websites you don't want indexed by bots you can require that someone solve a captcha or other problem that requires human intervention before granting access to said information.

      • (Score: 0) by Anonymous Coward on Friday May 22 2015, @04:37AM

        by Anonymous Coward on Friday May 22 2015, @04:37AM (#186323)

        Because everybody just lurves doing captchas.