Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @04:36AM

    by Anonymous Coward on Thursday March 05 2015, @04:36AM (#153384)

    I'm not sure it's what you're looking for, but if you haven't heard about TiddlyWiki, you should probably take a look. I've been using it for a couple years, and it's insanely flexible, very easy to use, and is very good for incrementally organizing large heterogenous data. I'm currently beginning to use it in a similar capacity (albeit, probably smaller scale).

    It supports tagging, and has a lot of different deployment options. If I were working your problem in an incremental fashion, I might drop a mini-wiki into each folder, and use that wiki to organize that folder, and then go back and start linking all the mini-folder wikis together into a master wiki (I believe TiddlyWiki can include sub-wikis by reference).

    At the point I'm at, I'm starting to get into the widget language for building in more customization, but I haven't yet dug into the underlying JavaScript, plugin language or node.js.

    It has changed my life. I'm much more organized now. I've organized and *mostly* codified my whole approach to life.

    I have a few reservations about it, but every time I think about them, they pale in comparison to what I've been able to accomplish with it.

    Reservations:

    1) While designed to scale, it does have limits. I guess I would characterize this as dimensionality of scaling. If you only want to scale in one dimension, say recording a few attributes for a number of files, but for lots of files, it might work really well. If you want to scale in multiple dimensions, recording arbitrarily detailed attributes, and for lots of files, you will probably run into more complexity. Since most of my scaling issues are fairly one-dimensional, I can usually align a pretty good approach.

    2) Uneven. While so many things are dead easy in TW, a few are surprisingly hard. The programmatic interface assumes a data model, and you have to become familiar with it to be effective. So, once you're a little ways off the beaten path, things start to get complex in a hurry. This might be mitigated a lot if you're already familiar with JavaScript or node.js. I'm not. The good news is that the beaten path is pretty well beaten down, and a lot of thought has gone into it.

    It's still a work in progress, but I don't really expect the unevenness to get much better in the next couple years. While I continue to see a lot of improvements, I have reason to believe that there's still some fundamental limitations in place with regard to what the wiki syntax can do and the programming model. I hope that these will be incrementally burned off, as the architecture is robust enough to allow for that, but I'm not sure if that's where the development focus is right now.

    http://tiddlywiki.com/ [tiddlywiki.com]

    Starting Score:    0  points
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  

    Total Score:   1  
  • (Score: 2) by hendrikboom on Thursday March 05 2015, @05:29PM

    by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @05:29PM (#153572) Homepage Journal

    I'm aware of tiddlywiki. As I understand it, the entire text corpus resides in the same file as the tiddlywiki code. Which raises the question: How to update the code when there's a new version? I suppose one legitimate answer is that you don't.

    Tiddlywiki may well be useful for things like collecting ideas and notes in the early stages of project planning. And those notes will be useful if I come back to the project years later for any reason.

    What I've started doing is writing a lot of text into my programming projects. This might very well fit into tiddlywikis. There's text about the project design. There's a speculative diary, in which I record my thoughts about where the project might be going -- what approaches there are to various problems, what tentative stopgaps might be used until I get to a real solution, and so forth. Some of those are plain text, some are hand-coded simple html, and some are html generated from asciidoc.

    They are linked through various directories. (Moving a directory is always awkward because of the html links between them.) At the higher levels, these are more project lists than implementation details.

    Attaching these to a search engine with content analysis might be an effective way of indexing at least this part of the file base. Is there some way of embedding hand-made content tags into the html or asciidoc so they stay together with the files? There probably is.

    Anyone know any handy libre content analysis software?

    -- hendrik

    • (Score: 0) by Anonymous Coward on Saturday March 07 2015, @09:22PM

      by Anonymous Coward on Saturday March 07 2015, @09:22PM (#154230)

      In TiddlyWiki, when there's a new version released, you can grab a copy of the new version file, and import your old project.