Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Immerman on Thursday March 05 2015, @08:12PM

    by Immerman (3985) on Thursday March 05 2015, @08:12PM (#153632)

    I suspect it uses progressive winnowing - i.e. only searching the already reduced list as each new letter is added, which seems to run counter to Locate's interface at least - though I'll admit I never figured out how to get Locate to work as even a decent traditional search tool.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by hendrikboom on Thursday March 12 2015, @07:31PM

    by hendrikboom (1125) Subscriber Badge on Thursday March 12 2015, @07:31PM (#156849) Homepage Journal

    Another possibility is to use a mix of lazy and eager evaluation. When the first character is typed, all you need to find is the small number of entries that will actually fit on the screen. You can go on generating the rest of the list while waiting for the next character. When the second character is typed, you go through the list of remaining entries (which may only be partially generated) until you have a new screenful of surviving entries. Then while waiting for the third character, you go on generating this new list, switching gears when you get to the end of the second list and proceeding with the first, and when it expires, going on with the original index.

    And so on.

    • (Score: 2) by Immerman on Friday March 13 2015, @01:11AM

      by Immerman (3985) on Friday March 13 2015, @01:11AM (#157075)

      An excellent idea, but not I think one that it's using: the status bar immediately displays the number of matches, which wouldn't be possible with lazy evaluation. Though I would assume that the actual UI displayed list is generated on the fly - with as much data as is displayed for each file the comprehensive list would require far more RAM than is actually being used.

      Still, once I thought about it, if your average filename is, say, 20 characters, and a 1GHz processor can compare 1 billion characters per second, then that's 50 million files that could be scanned in a second: my paltry 150,000 files would take less than 1/300th of that. So a simple exhaustive search could potentially do the job quite tidily, especially if you assume path names are searched separately and then cross-referenced with the files they contain.