Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly [Skip to comment(s)]
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Disagree) by Anonymous Coward on Wednesday March 04 2015, @07:55PM

    by Anonymous Coward on Wednesday March 04 2015, @07:55PM (#153211)

    git

    • (Score: 3, Interesting) by VLM on Wednesday March 04 2015, @08:33PM

      by VLM (445) Subscriber Badge on Wednesday March 04 2015, @08:33PM (#153225)

      AC has it correct, stash your stuff in git to track changes.

      Something OP didn't think about was the inevitable death of the platform and your plan to pull the goodies out and reuse somehow. So proprietary SW would be insane. Theres some pretty crazy dead stuff out there that was cool, once.

      The free version of gitlab (basically similar to github) with the git web extension would do it.

      Something to think about before paying for something very advanced is you might end up with what amounts to a git backed wiki. Thats all. No need for bug trackers and merge routines and hooks and deployment. Just a wiki on top of git.

      • (Score: 2) by buswolley on Wednesday March 04 2015, @08:39PM

        by buswolley (848) on Wednesday March 04 2015, @08:39PM (#153228)

        what about binary documents?

        --
        subicular junctures
        • (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:14PM

          by Anonymous Coward on Wednesday March 04 2015, @09:14PM (#153245)

          Base64 and git

        • (Score: 2) by VLM on Wednesday March 04 2015, @09:18PM

          by VLM (445) Subscriber Badge on Wednesday March 04 2015, @09:18PM (#153247)

          What about em? disk space is cheap, labor is expensive, keep the old versions.

          • (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:07PM

            by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:07PM (#153280) Homepage Journal

            OK. What's em?

            Yes, disk space is cheap. But it's not just a question of keeping old versions somewhere, it's a question of being able to *find* them when you've forgotten the file name.

            • (Score: 2) by buswolley on Thursday March 05 2015, @01:07AM

              by buswolley (848) on Thursday March 05 2015, @01:07AM (#153327)

              Also, I thought git (i dont use it) was used for versioning file content (e.g. line 67 changed), which can't be done the same way if the text file is binary.

              --
              subicular junctures
              • (Score: 3, Informative) by khedoros on Thursday March 05 2015, @04:48AM

                by khedoros (2921) on Thursday March 05 2015, @04:48AM (#153391)
                Git handles binary files. Apparently you can even define diff methods for file types and for specific files. The example here [git-scm.com] uses a conversion utility to extract text from a docx file to do the diff. Even if that weren't possible, you could manually put a description of the change in when checking in the updated file.
      • (Score: 3, Interesting) by hendrikboom on Wednesday March 04 2015, @11:13PM

        by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:13PM (#153284) Homepage Journal

        At the moment I use monotone for revision management. It too is good at keeping track of changes. But it's not changes I'm asking about.

        It's more like, "Didn't I write a function a few years ago that used LU-decomposition to predict disk usage? I think I might adapt it to my new image-editing project. Now what was it called... For that matter, what project was it part of?"

        -- hendrik

      • (Score: 4, Interesting) by hendrikboom on Wednesday March 04 2015, @11:28PM

        by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:28PM (#153291) Homepage Journal

        You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.

        For example. I wrote a Lisp interpreter for the IBM 1620 in assembler. I still have it. I found it -- somewhere -- when I was looking for something else a few years ago. It has one known bug. I wasn't as good a programmer then as I am now. Someday I'd like to find and fix it. Not that it would be useful. It just bothers me. I sought the bug for a week or so back then before I had to get back to studying algebraic topology.

        But where's the code now?

        • (Score: 2) by frojack on Thursday March 05 2015, @04:33AM

          by frojack (1554) Subscriber Badge on Thursday March 05 2015, @04:33AM (#153382) Journal

          You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.

          Why?
          You still have the files. Re-index them when a better software comes along.

          --
          No, you are mistaken. I've always had this sig.
          • (Score: 2) by hendrikboom on Thursday March 05 2015, @01:07PM

            by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @01:07PM (#153482) Homepage Journal

            If the indexing is fully automatic, yes, reindexing is a nobrainer.

            But if it's even partially manual, I'd like to be able to go on using the old index.

      • (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @04:36AM

        by Anonymous Coward on Thursday March 05 2015, @04:36AM (#153384)

        I'm not sure it's what you're looking for, but if you haven't heard about TiddlyWiki, you should probably take a look. I've been using it for a couple years, and it's insanely flexible, very easy to use, and is very good for incrementally organizing large heterogenous data. I'm currently beginning to use it in a similar capacity (albeit, probably smaller scale).

        It supports tagging, and has a lot of different deployment options. If I were working your problem in an incremental fashion, I might drop a mini-wiki into each folder, and use that wiki to organize that folder, and then go back and start linking all the mini-folder wikis together into a master wiki (I believe TiddlyWiki can include sub-wikis by reference).

        At the point I'm at, I'm starting to get into the widget language for building in more customization, but I haven't yet dug into the underlying JavaScript, plugin language or node.js.

        It has changed my life. I'm much more organized now. I've organized and *mostly* codified my whole approach to life.

        I have a few reservations about it, but every time I think about them, they pale in comparison to what I've been able to accomplish with it.

        Reservations:

        1) While designed to scale, it does have limits. I guess I would characterize this as dimensionality of scaling. If you only want to scale in one dimension, say recording a few attributes for a number of files, but for lots of files, it might work really well. If you want to scale in multiple dimensions, recording arbitrarily detailed attributes, and for lots of files, you will probably run into more complexity. Since most of my scaling issues are fairly one-dimensional, I can usually align a pretty good approach.

        2) Uneven. While so many things are dead easy in TW, a few are surprisingly hard. The programmatic interface assumes a data model, and you have to become familiar with it to be effective. So, once you're a little ways off the beaten path, things start to get complex in a hurry. This might be mitigated a lot if you're already familiar with JavaScript or node.js. I'm not. The good news is that the beaten path is pretty well beaten down, and a lot of thought has gone into it.

        It's still a work in progress, but I don't really expect the unevenness to get much better in the next couple years. While I continue to see a lot of improvements, I have reason to believe that there's still some fundamental limitations in place with regard to what the wiki syntax can do and the programming model. I hope that these will be incrementally burned off, as the architecture is robust enough to allow for that, but I'm not sure if that's where the development focus is right now.

        http://tiddlywiki.com/ [tiddlywiki.com]

        • (Score: 2) by hendrikboom on Thursday March 05 2015, @05:29PM

          by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @05:29PM (#153572) Homepage Journal

          I'm aware of tiddlywiki. As I understand it, the entire text corpus resides in the same file as the tiddlywiki code. Which raises the question: How to update the code when there's a new version? I suppose one legitimate answer is that you don't.

          Tiddlywiki may well be useful for things like collecting ideas and notes in the early stages of project planning. And those notes will be useful if I come back to the project years later for any reason.

          What I've started doing is writing a lot of text into my programming projects. This might very well fit into tiddlywikis. There's text about the project design. There's a speculative diary, in which I record my thoughts about where the project might be going -- what approaches there are to various problems, what tentative stopgaps might be used until I get to a real solution, and so forth. Some of those are plain text, some are hand-coded simple html, and some are html generated from asciidoc.

          They are linked through various directories. (Moving a directory is always awkward because of the html links between them.) At the higher levels, these are more project lists than implementation details.

          Attaching these to a search engine with content analysis might be an effective way of indexing at least this part of the file base. Is there some way of embedding hand-made content tags into the html or asciidoc so they stay together with the files? There probably is.

          Anyone know any handy libre content analysis software?

          -- hendrik

          • (Score: 0) by Anonymous Coward on Saturday March 07 2015, @09:22PM

            by Anonymous Coward on Saturday March 07 2015, @09:22PM (#154230)

            In TiddlyWiki, when there's a new version released, you can grab a copy of the new version file, and import your old project.

    • (Score: 3, Insightful) by FatPhil on Wednesday March 04 2015, @11:21PM

      by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday March 04 2015, @11:21PM (#153288) Homepage
      At no point do I see the concept of version control being relevant at all.

      I recommeng git for pretty much everything that requires version control, but it's a hammer to this screw.

      To the OP - learn how to organise a director heirarchy, and symbolic links, so that you can access "a and b" things via both "a/b/" and "b/a", and if that's not good enough, then learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 3, Interesting) by sigma on Thursday March 05 2015, @01:08AM

        by sigma (1225) on Thursday March 05 2015, @01:08AM (#153328)

        learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.

        Sometimes that's not an option, and yep, git is wrong for this purpose. As I see it, OP has two paths to choose from: A system that allows for complex and versatile file/document management, or a system that minimises the need to manage the files.

        If managing the files is the goal, then a Content Management System (CMS) is the answer, and one open source CMS that should do the job is Alfresco - http://www.alfresco.com/. [alfresco.com] There may be others, but I know Alfresco will do everything OP asked for and more.

        The alternative path is to forget about directly managing the files, and to use an indexing search too, to make sense of it. My approach is the leave my files in relative chaos and let Apache's Solr/Lucerne - http://lucene.apache.org/solr/ [apache.org] solve that problem. I'm pretty sure it would achieve OP's goal, though in a different (and possibly challenging for those with OCD tendencies.

        • (Score: 2) by FatPhil on Thursday March 05 2015, @08:04AM

          by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Thursday March 05 2015, @08:04AM (#153443) Homepage
          Haha - my day job currently is to write an alternative to solr/lucene! Alas, I don't think it will ever be seen outside the context of email.
          --
          Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 4, Insightful) by frojack on Wednesday March 04 2015, @07:58PM

    by frojack (1554) Subscriber Badge on Wednesday March 04 2015, @07:58PM (#153213) Journal

    Ah, the curse of terabyte drives.

    First some fill in the blanks questions:

    Who (how many) are going to be using this? Just you? A small office of people?
    Where are all the files? Internal drive, NAS? Across a local area network?
    And, (drum roll) What OS does it have to reside on (not the documents, but the software)?

    All of these matter a great deal.

    --
    No, you are mistaken. I've always had this sig.
    • (Score: 2) by hendrikboom on Wednesday March 04 2015, @08:20PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @08:20PM (#153219) Homepage Journal

      Who (how many) are going to be using this? Just you? A small office of people?

      Just me. But I do hope that if anything useful comes of it, it'll be useful to others as well.

      Where are all the files? Internal drive, NAS? Across a local area network?

      Most are on a server that's currently running Debian Wheezy. It provides both NFS and sshfs. Some are on a laptop, which usually runs Linux. A very few on a Windows partition on that laptop. I have control over the server and the Linux part of the laptop. And there are multiple (but not perfectly up-to-date) copies on backup drives, which are normally *not* attached to the machine.

      And, (drum roll) What OS does it have to reside on (not the documents, but the software)?

      Something that can access the server, if not the server itself.

      • (Score: 2) by frojack on Wednesday March 04 2015, @11:26PM

        by frojack (1554) Subscriber Badge on Wednesday March 04 2015, @11:26PM (#153290) Journal

        Someone recommended recoll to me. http://software.opensuse.org/package/recoll [opensuse.org]

        Everything I know about it at this point I learned from that link. It mentions no database, which to me suggests its going to crawl your disk for everything you search, which if true is not acceptable.

        --
        No, you are mistaken. I've always had this sig.
        • (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:42PM

          by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:42PM (#153293) Homepage Journal

          recoll appears to be a front end to xapian, which is an interesting tool in itself. The recoll page you linked to says it doesn't use a data base, but xapian does. And the xapian page says it can handle data bases larger than 2G which is essential for large document collections. So I do suppose it uses a data base and doesn't scan the entire file system for every query.

      • (Score: 2) by sigma on Thursday March 05 2015, @01:15AM

        by sigma (1225) on Thursday March 05 2015, @01:15AM (#153331)

        I've posted suggestions above, but after reading this, I think you should look at Solr on the server and maybe Chrome POSTMAN for the clients.

  • (Score: 2) by buswolley on Wednesday March 04 2015, @08:17PM

    by buswolley (848) on Wednesday March 04 2015, @08:17PM (#153216)

    Just did a search so I dont know much but how about this:
    http://www.tagspaces.org [tagspaces.org]

    --
    subicular junctures
    • (Score: 2) by bart9h on Wednesday March 04 2015, @08:28PM

      by bart9h (767) on Wednesday March 04 2015, @08:28PM (#153222)

      where does it store the tags?

      what happen when I rename / move files around?

      • (Score: 2) by buswolley on Wednesday March 04 2015, @08:35PM

        by buswolley (848) on Wednesday March 04 2015, @08:35PM (#153226)

        Apparently Tagspace doesnt use a database, Its just a system enabling easy construction of filenames that are be parsed by the program as tags. Interesting.

        --
        subicular junctures
    • (Score: 2) by hendrikboom on Wednesday March 04 2015, @08:57PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @08:57PM (#153237) Homepage Journal

      Looks interesting, but seems to require manual tagging. It's open-source. It puts the tags into the filenames. This isn't likely to work well in the names of, say C include files. And it won't work for pieces of information smaller than a file. I wonder whether it uses a controlled vocabulary for the tags.

      I'll have to look at this further, as a source of ideas, and maybe as a directly usable system for part of the file base.
       

      • (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:23PM

        by Anonymous Coward on Wednesday March 04 2015, @09:23PM (#153249)

        Yeah. For documents it would be great. For the output of scripts, or for source files, not so sure.
        smaller than a file?

        • (Score: 2) by hendrikboom on Wednesday March 04 2015, @10:58PM

          by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @10:58PM (#153276) Homepage Journal

          Smaller than a file? Like an email with six attachments. Like a file of miscellaneous plot ideas for writing novels. like compiled functions in a library.

          • (Score: 0) by Anonymous Coward on Thursday March 05 2015, @01:05AM

            by Anonymous Coward on Thursday March 05 2015, @01:05AM (#153326)

            ah content indexing, I get it.

  • (Score: 2) by MrGuy on Wednesday March 04 2015, @08:38PM

    by MrGuy (1007) on Wednesday March 04 2015, @08:38PM (#153227)

    What's the end goal of "organizing" the documents? Are you trying to put them in a "proper" tree structure? If so, for what usage situation?

    Why is simply a run-time search (e.g. Apple Spotlight) insufficient for your needs? I'm not saying there aren't cases where there IS a good reason, but it's not clear your case is one of them. If you're dead set on trawling through directories and tagging files, power to you, but I'd start with WHY before we start a discussion on HOW.

    • (Score: 4, Informative) by frojack on Wednesday March 04 2015, @09:07PM

      by frojack (1554) Subscriber Badge on Wednesday March 04 2015, @09:07PM (#153242) Journal

      Agreed, for practical day to day use, organized tree tree structured directories require a lot of knowledge about the structure to be useful. Even then, finding anything in a big directory is a problem. But at least with Linux he can link the same document to multiple directories without duplicating the files. That too takes a lot of knowledge.

      Better is a good indexing system, the topic of this post, and a request for tools.

      With a good tool, you should be able to search not only for file names, but also file types, and, most importantly Content, and any tags or meta data you may have added. Most of all, it can't be relying on search-while-you-wait technology (find). It has to have an indexed database and be self maintaining.

      With such a tool, a good one, you could resort to just using a document heap. (Not recommending that, but you maybe could).

      I too am interested in such a tool, but its got to be self managing, widely accessible, fast, multiplatform too.

      ----------------
      I've found Baloo pretty good for this. (KDE4, OpenSuse). It is their new desktop indexing system to replace the prior iterations of Nepomuk which were horrible processor hogs. https://community.kde.org/Baloo [kde.org]

      It will index file name and CONTENT. So I just point it at every directory I want to be able to find stuff in. Including my source code directories. Want to know every use of a particular function call? Type its name in the Dolphin search, and (because its all fully indexed) your list of files is populated instantly.

      I use it mostly for content search. But I haven't tried it to see if it will index nfs mounts, and if so, at what network bandwidth price, and what happens when you dismount the hfs volume.

      --
      No, you are mistaken. I've always had this sig.
      • (Score: 2) by Immerman on Thursday March 05 2015, @02:24AM

        by Immerman (3985) on Thursday March 05 2015, @02:24AM (#153344)

        Agreed, a good search tool is a wonderful option. My own preference is the freeware tool Everything - on launch it nigh-instantly lists every file on the selected drives, and as fast as you can type it winnows the list to only those items whose name/path contains the specified word-fragments (alternately it can employ regular expressions). And thanks to the fact that it exploits the NTFS journal to make index updates almost instantaneous, and it takes under a minute to build from scratch an index of hundreds of thousands of files, even on a slow HDD. And it typically takes only a few seconds for updates, even auto-detecting most changes immediately.

        Downsides are:
          - it can only search file names, but get into the habit of appending meta-tags to the end of the name and you can rest assured that you have a relatively future-proof and platform-agnostic "database": pretty much any search tool will work beautifully with file names
        - the NTFS journal exploitation only works on local NTFS drives, and seemingly only under Windows. For other file systems, or when running under WINE, you need to configure periodic index updates, or initiate them manually. (you also need admin rights to exploit the journal, though it can be configured to run as a service accessible to any user)

        It takes some tweaking to work decently under WINE [configure open files/folders on its context menu to run $exec(winebrowser "%1") ] , and still leaves much to be desired, but nothing else I've found is remotely in the same league. I'd love to find a comparable native alternative, but thus far I've come up empty handed.

        • (Score: 2) by frojack on Thursday March 05 2015, @04:50AM

          by frojack (1554) Subscriber Badge on Thursday March 05 2015, @04:50AM (#153393) Journal

          Probably not what the OP needs, since he is on linux.

          Often you don't have options of appending crap to the name. For instance, I have to index tons of source code. You don't get to change those names.
          Legal documents are another thing you really can't mess with.

          And a file name only search means you pretty much have to know the file name. Which is not likely going to be the case once you get beyond a few hundred thousand files.

          --
          No, you are mistaken. I've always had this sig.
          • (Score: 2) by Immerman on Thursday March 05 2015, @05:13AM

            by Immerman (3985) on Thursday March 05 2015, @05:13AM (#153406)

            I put the tweaks needed for WINE there for a reason - I mostly run Linux, and have pretty much given up finding a comparable native tool, and it's far to useful to abandon.

            As for file name limitations - yes, source code is a challenge, though I can't say I've seen any content-indexing systems that are worth much for source either, but fortunately it does tend to lend itself to hierarchical organization along module lines. For most everything else, I've begun abandoning short, condensed names in favor of actual descriptive ones that don't need to be memorized: "Trigonometry and Geometry quick-reference sheet v14.9.svg" will almost certainly be in the top ten results when I type "geo ref tri", and if not I'll just keep typing more requirements.

            One of the absolutely essential components of Everything is that I don't have to initiate a search - it shows me the matching results as fast as I type, starting with a list of all 200,000+ files on my computer when first launched. Even the fastest results after some "initiate search" trigger (hitting [Enter] or whatever) offer a qualitatively inferior experience.

          • (Score: 2) by hendrikboom on Thursday March 05 2015, @07:25PM

            by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @07:25PM (#153606) Homepage Journal

            Even though Everything is Windows-only, it's still worth discussing, because its salient features could perhaps be implemented in a Linux program. The OP *is* interested in tools that should exist but don't, after all.

            -- the OP

            • (Score: 2) by frojack on Thursday March 05 2015, @10:48PM

              by frojack (1554) Subscriber Badge on Thursday March 05 2015, @10:48PM (#153671) Journal

              The OP *is* interested in tools that should exist but don't, after all.

              The tools do exist. the OP just doesn't know what/where they are.

              --
              No, you are mistaken. I've always had this sig.
        • (Score: 2) by hendrikboom on Thursday March 05 2015, @06:52PM

          by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @06:52PM (#153595) Homepage Journal

          Everything looks useful.

          It might not be impossible to port or rewrite for Linux.

          Except for the keystroke interaction, it seems to do the same as locate and updatedb. I wonder if it would be easy to write code that does the keystroke interaction to search locate's database.

          • (Score: 2) by Immerman on Thursday March 05 2015, @08:12PM

            by Immerman (3985) on Thursday March 05 2015, @08:12PM (#153632)

            I suspect it uses progressive winnowing - i.e. only searching the already reduced list as each new letter is added, which seems to run counter to Locate's interface at least - though I'll admit I never figured out how to get Locate to work as even a decent traditional search tool.

            • (Score: 2) by hendrikboom on Thursday March 12 2015, @07:31PM

              by hendrikboom (1125) Subscriber Badge on Thursday March 12 2015, @07:31PM (#156849) Homepage Journal

              Another possibility is to use a mix of lazy and eager evaluation. When the first character is typed, all you need to find is the small number of entries that will actually fit on the screen. You can go on generating the rest of the list while waiting for the next character. When the second character is typed, you go through the list of remaining entries (which may only be partially generated) until you have a new screenful of surviving entries. Then while waiting for the third character, you go on generating this new list, switching gears when you get to the end of the second list and proceeding with the first, and when it expires, going on with the original index.

              And so on.

              • (Score: 2) by Immerman on Friday March 13 2015, @01:11AM

                by Immerman (3985) on Friday March 13 2015, @01:11AM (#157075)

                An excellent idea, but not I think one that it's using: the status bar immediately displays the number of matches, which wouldn't be possible with lazy evaluation. Though I would assume that the actual UI displayed list is generated on the fly - with as much data as is displayed for each file the comprehensive list would require far more RAM than is actually being used.

                Still, once I thought about it, if your average filename is, say, 20 characters, and a 1GHz processor can compare 1 billion characters per second, then that's 50 million files that could be scanned in a second: my paltry 150,000 files would take less than 1/300th of that. So a simple exhaustive search could potentially do the job quite tidily, especially if you assume path names are searched separately and then cross-referenced with the files they contain.

    • (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:45PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:45PM (#153296) Homepage Journal

      Why organise files? Because I want to be able to find them. I'm having trouble doing this now with locate and grep.

      Directories and symbolic links help. They're not enough, because they don't reveal anything about the content of the files.

    • (Score: 2) by hendrikboom on Thursday March 05 2015, @02:43AM

      by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @02:43AM (#153350) Homepage Journal

      I'd much prefer *not* to trawl through the whole file system attaching tags. I'd much rather have some kind of automated content analysis. But for some things it may be unavoidable. And new hand-written content may as well be born properly tagged.

      It's not as if I really believe that one single mechanism will suffice for everything.

      -- hendrik

  • (Score: 5, Funny) by buswolley on Wednesday March 04 2015, @08:42PM

    by buswolley (848) on Wednesday March 04 2015, @08:42PM (#153229)

    only organization system you'll ever need

    Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop/Old_Desktop...

    --
    subicular junctures
    • (Score: 2, Insightful) by Anonymous Coward on Wednesday March 04 2015, @08:44PM

      by Anonymous Coward on Wednesday March 04 2015, @08:44PM (#153230)

      Correction.
      The only organization system you'll ever USE.

  • (Score: 1) by No Respect on Wednesday March 04 2015, @08:48PM

    by No Respect (991) on Wednesday March 04 2015, @08:48PM (#153232)

    Sometimes it seems like a 2-D file hierarchy isn't enough. For instance, I get a file from my company president. It's about a project schedule. And it's important. And it has a date on it. What's a good way to save and/or catalog such a file so that it can be found when searching for it using any number of orthogonal attributes? There are probably solutions out there - document control is a big thing from what I hear - that use a backend database but that seems like overkill. Maybe not.

    My email needs are similar. Right now I'm setting up Thunderbird (it has to work with Windows) to interface with a gmail account using IMAP. It's not optimal because gmail has embraced and extended the IMAP standard. Slapping labels on everything is not the same as putting things in containers, but when faced with the need for a 5-dimensional array of "containers", it seems to work well enough. For now at least. I'm still not happy with it. A lot of emails come with attachments that can be detached from the email bodies, and then there's the same problem of how to organize those documents that the OP describes. Organization by simple filesystem hierarchy is inadequate for many needs.

    • (Score: 2) by Immerman on Thursday March 05 2015, @02:38AM

      by Immerman (3985) on Thursday March 05 2015, @02:38AM (#153348)

      Meta-tags are a decent solution. I append tags to the end of my file names, making them easy to locate with any filename search tool (of which Everything is far and away the best I've found - with nigh instant find-as-you-type). The folder hierarchy then becomes a secondary organization scheme, one that after 30 years I'm *still* working on finding good guidelines for, though it promises to be more useful as a secondary scheme than the primary one.

      • (Score: 2) by hendrikboom on Thursday March 05 2015, @02:58AM

        by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @02:58AM (#153355) Homepage Journal

        I'll have to investigate Everything.

        And the good guidelines you seek -- have you at least found some not-so-good guidelines?

        • (Score: 2) by Immerman on Thursday March 05 2015, @04:50AM

          by Immerman (3985) on Thursday March 05 2015, @04:50AM (#153395)

          You mean beyond "old_desktop/old_desktop/..."? I wish. Some top-level folders that seem, in a personal setting, to have proven to have some staying power (typically stored on a separate partition accessible from whatever OSes I'm multibooting, without permission headaches, etc.):
          - An "Attic" or "Archive" folder holding photos, backups, etc. that I'll rarely modify, delete, or access, but want to keep at hand.
          - A "Library" folder for music, videos, ebooks, etc - stuff that I didn't create, am not going to modify, but may access frequently. This one is typically fairly easy to organize along physical library lines.
          - A "Data" or "Documents" folder, for stuff I create, linked from my various home folders, (which are otherwise empty except for configuration files, which I don't really want mixed in with *my* data) - basically all the stuff I should really be backing up on a regular basis. Probably the mst disorganized of the lot.
          - A "References" folder, containing quick reference sheets, etc. (often ends up somplace in "Data" with a top-level link)
          - An "Active project" folder, containing only links/shortcuts to the projects I'm currently working on, wherever they may be in my hierarchy

          That much has even managed to survive my shift to a search-based organization, where it serves as a way to further winnow my search results. As for the various subfolders, guiding principles, etc? Nothing that's stood the test of time. I'll give major projects their own folder as a "file grouping" convenience, but have yet to come up with any consistent guiding principles beyond that.

          One thing I notice as I get accustomed to searching though is that my hierarchy is beginning to flatten - there's no longer a significant "finding penalty" to having 1,000 files in the same folder, and a deeper folder hierarchy means less potential filename length to hold tags before running into path-length restrictions, as well as the usual inconsistent chaos that tends to plague it.

          I consider "find as you type" to be *absolutely* essential though, even searches that give "instant" results as soon as you hit [Enter] can't compare. It lets me see as I type how well my winnowing is performing thus far. Do I need to add a few more letters to the current word fragment to clarify? Add some more fragments? Or have I already gotten down to just a handful of files from which I can spot what I want at a glance?

          If you're on a 'nix though Everything comes with caveats - it'll run fine in WINE, but you have to configure the paths you want it to index, and to periodically update the index since it can't exploit the NTFS journal for continuous monitoring (on Windows it defaults to indexing all local NTFS drives) . And you'll want to configure the context menu so that "open file" and "open folder" both execute
                    $exec(winebrowser "%1")
          and "open path" executes
                  $exec(winebrowser "$pathpart(%1)")
          That'll get you the core functionality at least. I'm not happy with the WINE-induced path-name inconsistencies, but haven't found anything native to compare.

          • (Score: 2) by Immerman on Thursday March 05 2015, @04:53AM

            by Immerman (3985) on Thursday March 05 2015, @04:53AM (#153397)

            Oh, also - I'm still on the fence about tags, but I'm getting better about using long descriptive filenames, which work almost as well since Everything doesn't actually distinguish between tags and the rest of the filename.

            • (Score: 2) by hendrikboom on Thursday March 05 2015, @07:04PM

              by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @07:04PM (#153600) Homepage Journal

              The really distinctive feature of Everything would appear to be the interaction mechanism. That could be rewritten. I mentioned using it with the locate database. Ideally, this interactive search could be applied too indexes other than a list of files... then other code could prepare indexes of, say, the metadata in photos, indexing tags in html documents, and they could all be searched at once.

              I wonder how the index is organised. Perhaps differently from the locate database, which I suspect is just a compressed list of all files, to be scanned with something like grep. Does it slow down when you search for two words? Or is the set of found things small enough after the first word that a complete in-memory scan of the residue so far becomes feasible.

              • (Score: 2) by Immerman on Thursday March 05 2015, @08:27PM

                by Immerman (3985) on Thursday March 05 2015, @08:27PM (#153639)

                Quite. Most everything else I've tried uses the "enter terms then initiate search" interaction, which doesn't begin to compare for ease-of-use on a regular basis - I rarely even use a file browser anymore.

                Given the speed of search - apparently instant, even when the initial list shows a quarter-million entries and multiple word fragments are used (though admittedly by the time you've finalized the first fragment the list has already been reduced dramatically), my first instinct would be that it uses an optimized version of a traditional sparse-matrix indexing scheme, with every file listed under all possible fragments (MyFile gets indexed under MyFile, yFile, File, ile, le, and e) But then indexing schemes were never really my forte. A grep-style scan over hundreds of thousands of entries should (I would think) take at a decent fraction of a second on a slow machine, but I've never noticed any lag at all. Though presuming that each new character is only searched for within the results of the previous step would reduce that significantly after the first couple characters are entered.

                Hmm, let's see if we can find some hints - on my current system it's listing 144,000 files, with a memory usage (under WINE, not sure how that might effect things) of 15.2MiB, and a database size of 2.0MB So that's a maximum average of ~15 bytes per file in the database, and 108 in the live index, with only a fraction of a second required to build the index from the database (which when opened in a hex editor appears to be full of fragmentary file names interspersed with binary data). My guess would be it's using a variation of the traditional text index where MyFile gets indexed under MyFile, yFile, File, ile,le, and e, but I'm well outside my area of competency

  • (Score: 2) by Runaway1956 on Wednesday March 04 2015, @09:04PM

    by Runaway1956 (2926) Subscriber Badge on Wednesday March 04 2015, @09:04PM (#153241) Homepage Journal

    What we need is, an AI capable of keeping track of all this shite. Ideally, the AI should be able to read out minds, as well as keeping track of all the odds and ends.

    I say, "Nelson, Jimmy sent me a memo last year . . . "
    As I trail off, Nelson answers, "You're thinking about those purchases of peripheral equipment, and you're trying to remember why you chose Brand A over Brands B, C, D, and Y. Here's that memo, complete with Jimmy's insinuation that he might get a kickback for buying brand A."

    With help like that, we wouldn't NEED software!

    --
    Our first six presidents were educated men. Then, along came a Democrat.
    • (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:45PM

      by Anonymous Coward on Wednesday March 04 2015, @09:45PM (#153256)

      So you've seen the movie "Her" have you?

    • (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:00PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:00PM (#153277) Homepage Journal

      That would be useful.

    • (Score: 0) by Anonymous Coward on Thursday March 05 2015, @09:39AM

      by Anonymous Coward on Thursday March 05 2015, @09:39AM (#153449)

      You start with

      What we need is, an AI capable of […]

      and finish with

      With help like that, we wouldn't NEED software!

      So that AI would not be software?

      • (Score: 2) by Runaway1956 on Thursday March 05 2015, @09:45AM

        by Runaway1956 (2926) Subscriber Badge on Thursday March 05 2015, @09:45AM (#153452) Homepage Journal

        Not in the sense that we are constantly looking for crap to install on our computers. The damned thing would come from the factory, ready to satisfy all of our needs and inquiries. We won't need questions like this posted to slashdot, because our "companion" has already answered all our questions.

        --
        Our first six presidents were educated men. Then, along came a Democrat.
        • (Score: 2) by Open4D on Thursday March 05 2015, @03:24PM

          by Open4D (371) Subscriber Badge on Thursday March 05 2015, @03:24PM (#153531) Journal

          s/slashdot/soylent/g

          :)

          • (Score: 2) by Runaway1956 on Thursday March 05 2015, @04:56PM

            by Runaway1956 (2926) Subscriber Badge on Thursday March 05 2015, @04:56PM (#153562) Homepage Journal

            *groan*

            Did I do that?

            --
            Our first six presidents were educated men. Then, along came a Democrat.
            • (Score: 2) by Yog-Yogguth on Tuesday March 10 2015, @01:50PM

              by Yog-Yogguth (1862) Subscriber Badge on Tuesday March 10 2015, @01:50PM (#155390) Journal

              Soylent News is more Slashdot than Slashdot ever was :)

              --
              Bite harder Ouroboros, bite! tails.boum.org/ linux USB CD secure desktop IRC *crypt tor (not endorsements (XKeyScore))
  • (Score: 4, Informative) by Nerdfest on Wednesday March 04 2015, @09:13PM

    by Nerdfest (80) on Wednesday March 04 2015, @09:13PM (#153244)

    Try Nuxeo or Alfresco. I've used Nuxeo and it's quite workable (follows standards, written in Java, cross platform). It should do everything you need.

  • (Score: 4, Funny) by wonkey_monkey on Wednesday March 04 2015, @09:47PM

    by wonkey_monkey (279) on Wednesday March 04 2015, @09:47PM (#153257) Homepage

    ...as soon as I can find it.

    --
    systemd is Roko's Basilisk
  • (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @10:11PM

    by Anonymous Coward on Wednesday March 04 2015, @10:11PM (#153265)

    My skills are recall of retentive information stored in a retrieval system...

    ... Oh not that kind of looking, nor that kind of tool.

  • (Score: 1, Interesting) by Anonymous Coward on Thursday March 05 2015, @03:15AM

    by Anonymous Coward on Thursday March 05 2015, @03:15AM (#153359)

    I only know Windows and the NTFS filesystem. This filesystem indexes all the file names as metadata at a location of the hard disk known as the "Master File Table". There are tools available such as this one: Everything [voidtools.com] which will bypass the slowdown and bottleneck of the Windows API calls and directly read the file name information from the MFT. This allows blinding fast search results on millions of files and terabyte-sized hard drives.

    I am curious to know if Unix/Linux has an equivalent to this? .....for comparison purposes, it would be a plus if you also have first-hand knowledge of the capabilities of "Everything" on NTFS volumes.

    • (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @09:43AM

      by Anonymous Coward on Thursday March 05 2015, @09:43AM (#153451)

      Since on Unix, the file name is not a property of the file, but merely of the directory entry pointing to the file (a single file can easily have several different names!) the same cannot exist on Unix/Linux. However there's the locate utility which keeps a separate database of file names (regularly updated via cron job), which effectively does the same job.

    • (Score: 1) by TLA on Thursday March 05 2015, @02:34PM

      by TLA (5128) on Thursday March 05 2015, @02:34PM (#153505) Journal

      I use that, too. Bloody brilliant bit of kit.

      --
      Excuse me, I think I need to reboot my horse. - NCommander
  • (Score: 2, Interesting) by TLA on Thursday March 05 2015, @07:15AM

    by TLA (5128) on Thursday March 05 2015, @07:15AM (#153429) Journal

    I can say that there is no single or simple solution. I use a multi-tiered solution which involves, among other things, virtualisation of processing across multiple discrete processors (dynamic clustering, if you like), a distributed filesystem layer (of my own design and implementation), a wiki-based DBMS which holds all the static (and most of the dynamic) data including search strings used in its own cache and also retaining the ability to store binaries such as Adobe .pdf and Microsoft .doc files in their raw form (ie as discrete objects on the filesystem rather than as object containers in the database); about a dozen different fulltext search tools including Agent Ransack, UltraFileSearch (contrary to how it sounds, no it isn't a BHO, it's a desktop app that allows you to run deep text searches across directory structures), ExamDiff (great for a line-by-line comparison of legal documents), Acrobat Pro (the paid for commercial version, not the reader) which has the capability of running fulltext searches and comparisons in pdf files - across directory structures much as UFS does, and Dragon Naturally Speaking because I often find it easier to dictate documents than type them, particularly when others are trying to sleep and don't want to be hearing keyboards clacking away at 4am. There's more but a lot of it falls under the category of Protected Trade Secrets.

    --
    Excuse me, I think I need to reboot my horse. - NCommander
    • (Score: 0) by Anonymous Coward on Thursday March 05 2015, @10:43AM

      by Anonymous Coward on Thursday March 05 2015, @10:43AM (#153462)

      Add KDiff3 to that list

      • (Score: 1) by TLA on Thursday March 05 2015, @02:41PM

        by TLA (5128) on Thursday March 05 2015, @02:41PM (#153510) Journal

        thanky for that! I'll have a look at this and get back to you... already, it's good to see that it's ported across platforms!

        --
        Excuse me, I think I need to reboot my horse. - NCommander
  • (Score: 3, Interesting) by hendrikboom on Thursday March 05 2015, @07:56PM

    by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @07:56PM (#153624) Homepage Journal

    There are file formats that have a provision for metadata. No problem there. The metadata stays with the file when it is moved around, and edited. (Except of course that some software that does things to the file deletes the metadata. Bad tools. No cookie.)

    Even source code can bear comments that can contain metadata. But when made freely and publicly available, it may confuse some people who encounter it.

    But some file formats do not admit of metadata. Anyone have an idea how to attach metadata externally so it has a chance of staying attached? Perhaps a database somewhere with a way of fixing a file's identity? Perhaps something more creative?

  • (Score: 2) by darkfeline on Thursday March 05 2015, @08:57PM

    by darkfeline (1030) on Thursday March 05 2015, @08:57PM (#153649) Homepage

    Being somewhat OCD, this is a topic I've wasted a lot of time thinking about. I've been meaning to write a formal essay/thesis on the topic, but I'll just dump some of my notes here.

    Goal of file organization: to be able to find a given file in any arbitrary application in a universal manner as easily as possible.

    In practice, this means you'll have to do something on the file system level, perhaps with the help of a virtual file system.

    Fundamental requirements: each file must have at least one unique reference (in practice this is a file system path). The point is you must be able to unambiguously refer to a given file. An alternative might be using inode and device numbers, but good luck trying to get an implementation using that working.

    Optional requirements:

    Namespaces: Directories, in other words. They're useful, I guess, but see next point.

    Permanence: Unique refs (that is, paths) should not change. Changing refs = broken soft links (broken symlinks, broken resources, broken "recently used files"). Why would you change a unique ref?

    There are two types of refs: arbitrary and semantic.

    Semantic refs are refs that follow a given naming format, such as year/month/day/report number. These refs cannot be wrong if done correctly and will never need to be changed. Even if you migrate to a new naming format, the old naming format is still "correct" and can be preserved to maintain compatibility.

    Arbitrary refs are names you assign arbitrarily. For example, one photo of your cat you name cat.jpg, but another you name cute-kitten.jpg. These are bad, but sometimes necessary, as part of a semantic ref (for example, project names)

    Hierarchical organization: files and folders that are the norm. The problem with this is if a file belongs in multiple places.

    Multi-dimensional hierarchy: A file can exist in multiple folders at once. You can do this using hard links on *nix. This is great, but keep in mind from above, sematic refs = good, arbitrary refs = bad.

    Tagging: Tagging is great, but they must be used alongside unique refs, i.e. tradition files and folders. Why? Imagine searching using a query foo and opening a file bar. There's no guarantee that that file bar today will be the same file bar a week from now. This is bad. You NEED unique refs, ideally permanent unique refs.

    Implementation: Tagging should be done using a restful API on a virtual file system; otherise, you can't use it universally across applications, and that makes it significantly less useful.

    Shameless plug: I've created a tool called Dantalian in my quest to find the perfect organization solution. You might find it interesting.

    https://github.com/darkfeline/dantalian [github.com]

    tl;dr summary: try not to think too hard about it. Do your best to organize files as they are created, try not to go around renaming crap because you will break soft links, and rely on some kind of file search when you really need to find something.

    I might be forgetting something, but those are the key points I've thought about so far.

    --
    Join the SDF Public Access UNIX System today!
    • (Score: 2) by hendrikboom on Saturday March 07 2015, @05:41AM

      by hendrikboom (1125) Subscriber Badge on Saturday March 07 2015, @05:41AM (#154047) Homepage Journal

      Dantalian? As in the mystical archives of Dantalian?

      • (Score: 2) by darkfeline on Saturday March 07 2015, @09:18PM

        by darkfeline (1030) on Saturday March 07 2015, @09:18PM (#154228) Homepage

        Yes, as in, "This will allow you to cultivate a library rivaling the mystic archives." Alas, it was not to be, but I've gotten useful experience out of it.

        --
        Join the SDF Public Access UNIX System today!