Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.
posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by No Respect on Wednesday March 04 2015, @08:48PM

    by No Respect (991) on Wednesday March 04 2015, @08:48PM (#153232)

    Sometimes it seems like a 2-D file hierarchy isn't enough. For instance, I get a file from my company president. It's about a project schedule. And it's important. And it has a date on it. What's a good way to save and/or catalog such a file so that it can be found when searching for it using any number of orthogonal attributes? There are probably solutions out there - document control is a big thing from what I hear - that use a backend database but that seems like overkill. Maybe not.

    My email needs are similar. Right now I'm setting up Thunderbird (it has to work with Windows) to interface with a gmail account using IMAP. It's not optimal because gmail has embraced and extended the IMAP standard. Slapping labels on everything is not the same as putting things in containers, but when faced with the need for a 5-dimensional array of "containers", it seems to work well enough. For now at least. I'm still not happy with it. A lot of emails come with attachments that can be detached from the email bodies, and then there's the same problem of how to organize those documents that the OP describes. Organization by simple filesystem hierarchy is inadequate for many needs.

  • (Score: 2) by Immerman on Thursday March 05 2015, @02:38AM

    by Immerman (3985) on Thursday March 05 2015, @02:38AM (#153348)

    Meta-tags are a decent solution. I append tags to the end of my file names, making them easy to locate with any filename search tool (of which Everything is far and away the best I've found - with nigh instant find-as-you-type). The folder hierarchy then becomes a secondary organization scheme, one that after 30 years I'm *still* working on finding good guidelines for, though it promises to be more useful as a secondary scheme than the primary one.

    • (Score: 2) by hendrikboom on Thursday March 05 2015, @02:58AM

      by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @02:58AM (#153355) Homepage Journal

      I'll have to investigate Everything.

      And the good guidelines you seek -- have you at least found some not-so-good guidelines?

      • (Score: 2) by Immerman on Thursday March 05 2015, @04:50AM

        by Immerman (3985) on Thursday March 05 2015, @04:50AM (#153395)

        You mean beyond "old_desktop/old_desktop/..."? I wish. Some top-level folders that seem, in a personal setting, to have proven to have some staying power (typically stored on a separate partition accessible from whatever OSes I'm multibooting, without permission headaches, etc.):
        - An "Attic" or "Archive" folder holding photos, backups, etc. that I'll rarely modify, delete, or access, but want to keep at hand.
        - A "Library" folder for music, videos, ebooks, etc - stuff that I didn't create, am not going to modify, but may access frequently. This one is typically fairly easy to organize along physical library lines.
        - A "Data" or "Documents" folder, for stuff I create, linked from my various home folders, (which are otherwise empty except for configuration files, which I don't really want mixed in with *my* data) - basically all the stuff I should really be backing up on a regular basis. Probably the mst disorganized of the lot.
        - A "References" folder, containing quick reference sheets, etc. (often ends up somplace in "Data" with a top-level link)
        - An "Active project" folder, containing only links/shortcuts to the projects I'm currently working on, wherever they may be in my hierarchy

        That much has even managed to survive my shift to a search-based organization, where it serves as a way to further winnow my search results. As for the various subfolders, guiding principles, etc? Nothing that's stood the test of time. I'll give major projects their own folder as a "file grouping" convenience, but have yet to come up with any consistent guiding principles beyond that.

        One thing I notice as I get accustomed to searching though is that my hierarchy is beginning to flatten - there's no longer a significant "finding penalty" to having 1,000 files in the same folder, and a deeper folder hierarchy means less potential filename length to hold tags before running into path-length restrictions, as well as the usual inconsistent chaos that tends to plague it.

        I consider "find as you type" to be *absolutely* essential though, even searches that give "instant" results as soon as you hit [Enter] can't compare. It lets me see as I type how well my winnowing is performing thus far. Do I need to add a few more letters to the current word fragment to clarify? Add some more fragments? Or have I already gotten down to just a handful of files from which I can spot what I want at a glance?

        If you're on a 'nix though Everything comes with caveats - it'll run fine in WINE, but you have to configure the paths you want it to index, and to periodically update the index since it can't exploit the NTFS journal for continuous monitoring (on Windows it defaults to indexing all local NTFS drives) . And you'll want to configure the context menu so that "open file" and "open folder" both execute
                  $exec(winebrowser "%1")
        and "open path" executes
                $exec(winebrowser "$pathpart(%1)")
        That'll get you the core functionality at least. I'm not happy with the WINE-induced path-name inconsistencies, but haven't found anything native to compare.

        • (Score: 2) by Immerman on Thursday March 05 2015, @04:53AM

          by Immerman (3985) on Thursday March 05 2015, @04:53AM (#153397)

          Oh, also - I'm still on the fence about tags, but I'm getting better about using long descriptive filenames, which work almost as well since Everything doesn't actually distinguish between tags and the rest of the filename.

          • (Score: 2) by hendrikboom on Thursday March 05 2015, @07:04PM

            by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @07:04PM (#153600) Homepage Journal

            The really distinctive feature of Everything would appear to be the interaction mechanism. That could be rewritten. I mentioned using it with the locate database. Ideally, this interactive search could be applied too indexes other than a list of files... then other code could prepare indexes of, say, the metadata in photos, indexing tags in html documents, and they could all be searched at once.

            I wonder how the index is organised. Perhaps differently from the locate database, which I suspect is just a compressed list of all files, to be scanned with something like grep. Does it slow down when you search for two words? Or is the set of found things small enough after the first word that a complete in-memory scan of the residue so far becomes feasible.

            • (Score: 2) by Immerman on Thursday March 05 2015, @08:27PM

              by Immerman (3985) on Thursday March 05 2015, @08:27PM (#153639)

              Quite. Most everything else I've tried uses the "enter terms then initiate search" interaction, which doesn't begin to compare for ease-of-use on a regular basis - I rarely even use a file browser anymore.

              Given the speed of search - apparently instant, even when the initial list shows a quarter-million entries and multiple word fragments are used (though admittedly by the time you've finalized the first fragment the list has already been reduced dramatically), my first instinct would be that it uses an optimized version of a traditional sparse-matrix indexing scheme, with every file listed under all possible fragments (MyFile gets indexed under MyFile, yFile, File, ile, le, and e) But then indexing schemes were never really my forte. A grep-style scan over hundreds of thousands of entries should (I would think) take at a decent fraction of a second on a slow machine, but I've never noticed any lag at all. Though presuming that each new character is only searched for within the results of the previous step would reduce that significantly after the first couple characters are entered.

              Hmm, let's see if we can find some hints - on my current system it's listing 144,000 files, with a memory usage (under WINE, not sure how that might effect things) of 15.2MiB, and a database size of 2.0MB So that's a maximum average of ~15 bytes per file in the database, and 108 in the live index, with only a fraction of a second required to build the index from the database (which when opened in a hex editor appears to be full of fragmentary file names interspersed with binary data). My guess would be it's using a variation of the traditional text index where MyFile gets indexed under MyFile, yFile, File, ile,le, and e, but I'm well outside my area of competency