Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by darkfeline on Thursday March 05 2015, @08:57PM

    by darkfeline (1030) on Thursday March 05 2015, @08:57PM (#153649) Homepage

    Being somewhat OCD, this is a topic I've wasted a lot of time thinking about. I've been meaning to write a formal essay/thesis on the topic, but I'll just dump some of my notes here.

    Goal of file organization: to be able to find a given file in any arbitrary application in a universal manner as easily as possible.

    In practice, this means you'll have to do something on the file system level, perhaps with the help of a virtual file system.

    Fundamental requirements: each file must have at least one unique reference (in practice this is a file system path). The point is you must be able to unambiguously refer to a given file. An alternative might be using inode and device numbers, but good luck trying to get an implementation using that working.

    Optional requirements:

    Namespaces: Directories, in other words. They're useful, I guess, but see next point.

    Permanence: Unique refs (that is, paths) should not change. Changing refs = broken soft links (broken symlinks, broken resources, broken "recently used files"). Why would you change a unique ref?

    There are two types of refs: arbitrary and semantic.

    Semantic refs are refs that follow a given naming format, such as year/month/day/report number. These refs cannot be wrong if done correctly and will never need to be changed. Even if you migrate to a new naming format, the old naming format is still "correct" and can be preserved to maintain compatibility.

    Arbitrary refs are names you assign arbitrarily. For example, one photo of your cat you name cat.jpg, but another you name cute-kitten.jpg. These are bad, but sometimes necessary, as part of a semantic ref (for example, project names)

    Hierarchical organization: files and folders that are the norm. The problem with this is if a file belongs in multiple places.

    Multi-dimensional hierarchy: A file can exist in multiple folders at once. You can do this using hard links on *nix. This is great, but keep in mind from above, sematic refs = good, arbitrary refs = bad.

    Tagging: Tagging is great, but they must be used alongside unique refs, i.e. tradition files and folders. Why? Imagine searching using a query foo and opening a file bar. There's no guarantee that that file bar today will be the same file bar a week from now. This is bad. You NEED unique refs, ideally permanent unique refs.

    Implementation: Tagging should be done using a restful API on a virtual file system; otherise, you can't use it universally across applications, and that makes it significantly less useful.

    Shameless plug: I've created a tool called Dantalian in my quest to find the perfect organization solution. You might find it interesting.

    https://github.com/darkfeline/dantalian [github.com]

    tl;dr summary: try not to think too hard about it. Do your best to organize files as they are created, try not to go around renaming crap because you will break soft links, and rely on some kind of file search when you really need to find something.

    I might be forgetting something, but those are the key points I've thought about so far.

    --
    Join the SDF Public Access UNIX System today!
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by hendrikboom on Saturday March 07 2015, @05:41AM

    by hendrikboom (1125) Subscriber Badge on Saturday March 07 2015, @05:41AM (#154047) Homepage Journal

    Dantalian? As in the mystical archives of Dantalian?

    • (Score: 2) by darkfeline on Saturday March 07 2015, @09:18PM

      by darkfeline (1030) on Saturday March 07 2015, @09:18PM (#154228) Homepage

      Yes, as in, "This will allow you to cultivate a library rivaling the mystic archives." Alas, it was not to be, but I've gotten useful experience out of it.

      --
      Join the SDF Public Access UNIX System today!