Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Insightful) by FatPhil on Wednesday March 04 2015, @11:21PM

    by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Wednesday March 04 2015, @11:21PM (#153288) Homepage
    At no point do I see the concept of version control being relevant at all.

    I recommeng git for pretty much everything that requires version control, but it's a hammer to this screw.

    To the OP - learn how to organise a director heirarchy, and symbolic links, so that you can access "a and b" things via both "a/b/" and "b/a", and if that's not good enough, then learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    Starting Score:    1  point
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 3, Interesting) by sigma on Thursday March 05 2015, @01:08AM

    by sigma (1225) on Thursday March 05 2015, @01:08AM (#153328)

    learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.

    Sometimes that's not an option, and yep, git is wrong for this purpose. As I see it, OP has two paths to choose from: A system that allows for complex and versatile file/document management, or a system that minimises the need to manage the files.

    If managing the files is the goal, then a Content Management System (CMS) is the answer, and one open source CMS that should do the job is Alfresco - http://www.alfresco.com/. [alfresco.com] There may be others, but I know Alfresco will do everything OP asked for and more.

    The alternative path is to forget about directly managing the files, and to use an indexing search too, to make sense of it. My approach is the leave my files in relative chaos and let Apache's Solr/Lucerne - http://lucene.apache.org/solr/ [apache.org] solve that problem. I'm pretty sure it would achieve OP's goal, though in a different (and possibly challenging for those with OCD tendencies.

    • (Score: 2) by FatPhil on Thursday March 05 2015, @08:04AM

      by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Thursday March 05 2015, @08:04AM (#153443) Homepage
      Haha - my day job currently is to write an alternative to solr/lucene! Alas, I don't think it will ever be seen outside the context of email.
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves