Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Disagree) by Anonymous Coward on Wednesday March 04 2015, @07:55PM

    by Anonymous Coward on Wednesday March 04 2015, @07:55PM (#153211)

    git

    Starting Score:    0  points
    Moderation   +1  
       Underrated=1, Disagree=1, Total=2
    Extra 'Disagree' Modifier   0  

    Total Score:   1  
  • (Score: 3, Interesting) by VLM on Wednesday March 04 2015, @08:33PM

    by VLM (445) on Wednesday March 04 2015, @08:33PM (#153225)

    AC has it correct, stash your stuff in git to track changes.

    Something OP didn't think about was the inevitable death of the platform and your plan to pull the goodies out and reuse somehow. So proprietary SW would be insane. Theres some pretty crazy dead stuff out there that was cool, once.

    The free version of gitlab (basically similar to github) with the git web extension would do it.

    Something to think about before paying for something very advanced is you might end up with what amounts to a git backed wiki. Thats all. No need for bug trackers and merge routines and hooks and deployment. Just a wiki on top of git.

    • (Score: 2) by buswolley on Wednesday March 04 2015, @08:39PM

      by buswolley (848) on Wednesday March 04 2015, @08:39PM (#153228)

      what about binary documents?

      --
      subicular junctures
      • (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:14PM

        by Anonymous Coward on Wednesday March 04 2015, @09:14PM (#153245)

        Base64 and git

      • (Score: 2) by VLM on Wednesday March 04 2015, @09:18PM

        by VLM (445) on Wednesday March 04 2015, @09:18PM (#153247)

        What about em? disk space is cheap, labor is expensive, keep the old versions.

        • (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:07PM

          by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:07PM (#153280) Homepage Journal

          OK. What's em?

          Yes, disk space is cheap. But it's not just a question of keeping old versions somewhere, it's a question of being able to *find* them when you've forgotten the file name.

          • (Score: 2) by buswolley on Thursday March 05 2015, @01:07AM

            by buswolley (848) on Thursday March 05 2015, @01:07AM (#153327)

            Also, I thought git (i dont use it) was used for versioning file content (e.g. line 67 changed), which can't be done the same way if the text file is binary.

            --
            subicular junctures
            • (Score: 3, Informative) by khedoros on Thursday March 05 2015, @04:48AM

              by khedoros (2921) on Thursday March 05 2015, @04:48AM (#153391)
              Git handles binary files. Apparently you can even define diff methods for file types and for specific files. The example here [git-scm.com] uses a conversion utility to extract text from a docx file to do the diff. Even if that weren't possible, you could manually put a description of the change in when checking in the updated file.
    • (Score: 3, Interesting) by hendrikboom on Wednesday March 04 2015, @11:13PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:13PM (#153284) Homepage Journal

      At the moment I use monotone for revision management. It too is good at keeping track of changes. But it's not changes I'm asking about.

      It's more like, "Didn't I write a function a few years ago that used LU-decomposition to predict disk usage? I think I might adapt it to my new image-editing project. Now what was it called... For that matter, what project was it part of?"

      -- hendrik

    • (Score: 4, Interesting) by hendrikboom on Wednesday March 04 2015, @11:28PM

      by hendrikboom (1125) Subscriber Badge on Wednesday March 04 2015, @11:28PM (#153291) Homepage Journal

      You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.

      For example. I wrote a Lisp interpreter for the IBM 1620 in assembler. I still have it. I found it -- somewhere -- when I was looking for something else a few years ago. It has one known bug. I wasn't as good a programmer then as I am now. Someday I'd like to find and fix it. Not that it would be useful. It just bothers me. I sought the bug for a week or so back then before I had to get back to studying algebraic topology.

      But where's the code now?

      • (Score: 2) by frojack on Thursday March 05 2015, @04:33AM

        by frojack (1554) on Thursday March 05 2015, @04:33AM (#153382) Journal

        You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.

        Why?
        You still have the files. Re-index them when a better software comes along.

        --
        No, you are mistaken. I've always had this sig.
        • (Score: 2) by hendrikboom on Thursday March 05 2015, @01:07PM

          by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @01:07PM (#153482) Homepage Journal

          If the indexing is fully automatic, yes, reindexing is a nobrainer.

          But if it's even partially manual, I'd like to be able to go on using the old index.

    • (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @04:36AM

      by Anonymous Coward on Thursday March 05 2015, @04:36AM (#153384)

      I'm not sure it's what you're looking for, but if you haven't heard about TiddlyWiki, you should probably take a look. I've been using it for a couple years, and it's insanely flexible, very easy to use, and is very good for incrementally organizing large heterogenous data. I'm currently beginning to use it in a similar capacity (albeit, probably smaller scale).

      It supports tagging, and has a lot of different deployment options. If I were working your problem in an incremental fashion, I might drop a mini-wiki into each folder, and use that wiki to organize that folder, and then go back and start linking all the mini-folder wikis together into a master wiki (I believe TiddlyWiki can include sub-wikis by reference).

      At the point I'm at, I'm starting to get into the widget language for building in more customization, but I haven't yet dug into the underlying JavaScript, plugin language or node.js.

      It has changed my life. I'm much more organized now. I've organized and *mostly* codified my whole approach to life.

      I have a few reservations about it, but every time I think about them, they pale in comparison to what I've been able to accomplish with it.

      Reservations:

      1) While designed to scale, it does have limits. I guess I would characterize this as dimensionality of scaling. If you only want to scale in one dimension, say recording a few attributes for a number of files, but for lots of files, it might work really well. If you want to scale in multiple dimensions, recording arbitrarily detailed attributes, and for lots of files, you will probably run into more complexity. Since most of my scaling issues are fairly one-dimensional, I can usually align a pretty good approach.

      2) Uneven. While so many things are dead easy in TW, a few are surprisingly hard. The programmatic interface assumes a data model, and you have to become familiar with it to be effective. So, once you're a little ways off the beaten path, things start to get complex in a hurry. This might be mitigated a lot if you're already familiar with JavaScript or node.js. I'm not. The good news is that the beaten path is pretty well beaten down, and a lot of thought has gone into it.

      It's still a work in progress, but I don't really expect the unevenness to get much better in the next couple years. While I continue to see a lot of improvements, I have reason to believe that there's still some fundamental limitations in place with regard to what the wiki syntax can do and the programming model. I hope that these will be incrementally burned off, as the architecture is robust enough to allow for that, but I'm not sure if that's where the development focus is right now.

      http://tiddlywiki.com/ [tiddlywiki.com]

      • (Score: 2) by hendrikboom on Thursday March 05 2015, @05:29PM

        by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @05:29PM (#153572) Homepage Journal

        I'm aware of tiddlywiki. As I understand it, the entire text corpus resides in the same file as the tiddlywiki code. Which raises the question: How to update the code when there's a new version? I suppose one legitimate answer is that you don't.

        Tiddlywiki may well be useful for things like collecting ideas and notes in the early stages of project planning. And those notes will be useful if I come back to the project years later for any reason.

        What I've started doing is writing a lot of text into my programming projects. This might very well fit into tiddlywikis. There's text about the project design. There's a speculative diary, in which I record my thoughts about where the project might be going -- what approaches there are to various problems, what tentative stopgaps might be used until I get to a real solution, and so forth. Some of those are plain text, some are hand-coded simple html, and some are html generated from asciidoc.

        They are linked through various directories. (Moving a directory is always awkward because of the html links between them.) At the higher levels, these are more project lists than implementation details.

        Attaching these to a search engine with content analysis might be an effective way of indexing at least this part of the file base. Is there some way of embedding hand-made content tags into the html or asciidoc so they stay together with the files? There probably is.

        Anyone know any handy libre content analysis software?

        -- hendrik

        • (Score: 0) by Anonymous Coward on Saturday March 07 2015, @09:22PM

          by Anonymous Coward on Saturday March 07 2015, @09:22PM (#154230)

          In TiddlyWiki, when there's a new version released, you can grab a copy of the new version file, and import your old project.

  • (Score: 3, Insightful) by FatPhil on Wednesday March 04 2015, @11:21PM

    by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Wednesday March 04 2015, @11:21PM (#153288) Homepage
    At no point do I see the concept of version control being relevant at all.

    I recommeng git for pretty much everything that requires version control, but it's a hammer to this screw.

    To the OP - learn how to organise a director heirarchy, and symbolic links, so that you can access "a and b" things via both "a/b/" and "b/a", and if that's not good enough, then learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 3, Interesting) by sigma on Thursday March 05 2015, @01:08AM

      by sigma (1225) on Thursday March 05 2015, @01:08AM (#153328)

      learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.

      Sometimes that's not an option, and yep, git is wrong for this purpose. As I see it, OP has two paths to choose from: A system that allows for complex and versatile file/document management, or a system that minimises the need to manage the files.

      If managing the files is the goal, then a Content Management System (CMS) is the answer, and one open source CMS that should do the job is Alfresco - http://www.alfresco.com/. [alfresco.com] There may be others, but I know Alfresco will do everything OP asked for and more.

      The alternative path is to forget about directly managing the files, and to use an indexing search too, to make sense of it. My approach is the leave my files in relative chaos and let Apache's Solr/Lucerne - http://lucene.apache.org/solr/ [apache.org] solve that problem. I'm pretty sure it would achieve OP's goal, though in a different (and possibly challenging for those with OCD tendencies.

      • (Score: 2) by FatPhil on Thursday March 05 2015, @08:04AM

        by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Thursday March 05 2015, @08:04AM (#153443) Homepage
        Haha - my day job currently is to write an alternative to solr/lucene! Alas, I don't think it will ever be seen outside the context of email.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves