SoylentNews Comments | Ask Soylent: Looking for Document and File Organisation Tools

Ask Soylent: Looking for Document and File Organisation Tools

posted by janrinok on Wednesday March 04 2015, @07:27PM

from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

This discussion has been archived. No new comments can be posted.

Ask Soylent: Looking for Document and File Organisation Tools | Log In/Create an Account | Top | 74 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

git git (Score: 1, Disagree) by Anonymous Coward on Wednesday March 04 2015, @07:55PM

by Anonymous Coward on Wednesday March 04 2015, @07:55PM (#153211)

git

Starting Score:	0		points
Moderation		+1
Underrated=1, Disagree=1, Total=2
Extra 'Disagree' Modifier		0

Total Score:		1

Re:git Re:git (Score: 3, Interesting) by VLM on Wednesday March 04 2015, @08:33PM

by VLM (445) on Wednesday March 04 2015, @08:33PM (#153225)

AC has it correct, stash your stuff in git to track changes.
Something OP didn't think about was the inevitable death of the platform and your plan to pull the goodies out and reuse somehow. So proprietary SW would be insane. Theres some pretty crazy dead stuff out there that was cool, once.
The free version of gitlab (basically similar to github) with the git web extension would do it.
Something to think about before paying for something very advanced is you might end up with what amounts to a git backed wiki. Thats all. No need for bug trackers and merge routines and hooks and deployment. Just a wiki on top of git.

Parent
- Re:git Re:git (Score: 2) by buswolley on Wednesday March 04 2015, @08:39PM
  
  by buswolley (848) on Wednesday March 04 2015, @08:39PM (#153228)
  
  what about binary documents?
  
  --
  subicular junctures
  
  Parent
  - Re:git (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:14PM
    
    by Anonymous Coward on Wednesday March 04 2015, @09:14PM (#153245)
    
    Base64 and git
    
    Parent
  - Re:git Re:git (Score: 2) by VLM on Wednesday March 04 2015, @09:18PM
    
    by VLM (445) on Wednesday March 04 2015, @09:18PM (#153247)
    
    What about em? disk space is cheap, labor is expensive, keep the old versions.
    
    Parent
    - Re:git Re:git (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:07PM
      
      by hendrikboom (1125) on Wednesday March 04 2015, @11:07PM (#153280) Homepage Journal
      
      OK. What's em?
      Yes, disk space is cheap. But it's not just a question of keeping old versions somewhere, it's a question of being able to *find* them when you've forgotten the file name.
      
      Parent
      - Re:git Re:git (Score: 2) by buswolley on Thursday March 05 2015, @01:07AM
        
        by buswolley (848) on Thursday March 05 2015, @01:07AM (#153327)
        
        Also, I thought git (i dont use it) was used for versioning file content (e.g. line 67 changed), which can't be done the same way if the text file is binary.
        
        --
        subicular junctures
        
        Parent
        
        Re:git (Score: 3, Informative) by khedoros on Thursday March 05 2015, @04:48AM
        
        by khedoros (2921) on Thursday March 05 2015, @04:48AM (#153391)
        
        Git handles binary files. Apparently you can even define diff methods for file types and for specific files. The example here [git-scm.com] uses a conversion utility to extract text from a docx file to do the diff. Even if that weren't possible, you could manually put a description of the change in when checking in the updated file.
        
        Parent
- Re:git (Score: 3, Interesting) by hendrikboom on Wednesday March 04 2015, @11:13PM
  
  by hendrikboom (1125) on Wednesday March 04 2015, @11:13PM (#153284) Homepage Journal
  
  At the moment I use monotone for revision management. It too is good at keeping track of changes. But it's not changes I'm asking about.
  It's more like, "Didn't I write a function a few years ago that used LU-decomposition to predict disk usage? I think I might adapt it to my new image-editing project. Now what was it called... For that matter, what project was it part of?"
  -- hendrik
  
  Parent
- Re:git Re:git (Score: 4, Interesting) by hendrikboom on Wednesday March 04 2015, @11:28PM
  
  by hendrikboom (1125) on Wednesday March 04 2015, @11:28PM (#153291) Homepage Journal
  
  You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.
  For example. I wrote a Lisp interpreter for the IBM 1620 in assembler. I still have it. I found it -- somewhere -- when I was looking for something else a few years ago. It has one known bug. I wasn't as good a programmer then as I am now. Someday I'd like to find and fix it. Not that it would be useful. It just bothers me. I sought the bug for a week or so back then before I had to get back to studying algebraic topology.
  But where's the code now?
  
  Parent
  - Re:git Re:git (Score: 2) by frojack on Thursday March 05 2015, @04:33AM
    
    by frojack (1554) on Thursday March 05 2015, @04:33AM (#153382) Journal
    
    You are absolutely right about avoiding proprietary software. I still have some files dating back to the 60's. Any indexing system I use should be as long-lived as the files themselves.
    Why?
    You still have the files. Re-index them when a better software comes along.
    
    --
    No, you are mistaken. I've always had this sig.
    
    Parent
    - Re:git (Score: 2) by hendrikboom on Thursday March 05 2015, @01:07PM
      
      by hendrikboom (1125) on Thursday March 05 2015, @01:07PM (#153482) Homepage Journal
      
      If the indexing is fully automatic, yes, reindexing is a nobrainer.
      But if it's even partially manual, I'd like to be able to go on using the old index.
      
      Parent
- Re:git Re:git (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @04:36AM
  
  by Anonymous Coward on Thursday March 05 2015, @04:36AM (#153384)
  
  I'm not sure it's what you're looking for, but if you haven't heard about TiddlyWiki, you should probably take a look. I've been using it for a couple years, and it's insanely flexible, very easy to use, and is very good for incrementally organizing large heterogenous data. I'm currently beginning to use it in a similar capacity (albeit, probably smaller scale).
  It supports tagging, and has a lot of different deployment options. If I were working your problem in an incremental fashion, I might drop a mini-wiki into each folder, and use that wiki to organize that folder, and then go back and start linking all the mini-folder wikis together into a master wiki (I believe TiddlyWiki can include sub-wikis by reference).
  At the point I'm at, I'm starting to get into the widget language for building in more customization, but I haven't yet dug into the underlying JavaScript, plugin language or node.js.
  It has changed my life. I'm much more organized now. I've organized and *mostly* codified my whole approach to life.
  I have a few reservations about it, but every time I think about them, they pale in comparison to what I've been able to accomplish with it.
  Reservations:
  1) While designed to scale, it does have limits. I guess I would characterize this as dimensionality of scaling. If you only want to scale in one dimension, say recording a few attributes for a number of files, but for lots of files, it might work really well. If you want to scale in multiple dimensions, recording arbitrarily detailed attributes, and for lots of files, you will probably run into more complexity. Since most of my scaling issues are fairly one-dimensional, I can usually align a pretty good approach.
  2) Uneven. While so many things are dead easy in TW, a few are surprisingly hard. The programmatic interface assumes a data model, and you have to become familiar with it to be effective. So, once you're a little ways off the beaten path, things start to get complex in a hurry. This might be mitigated a lot if you're already familiar with JavaScript or node.js. I'm not. The good news is that the beaten path is pretty well beaten down, and a lot of thought has gone into it.
  It's still a work in progress, but I don't really expect the unevenness to get much better in the next couple years. While I continue to see a lot of improvements, I have reason to believe that there's still some fundamental limitations in place with regard to what the wiki syntax can do and the programming model. I hope that these will be incrementally burned off, as the architecture is robust enough to allow for that, but I'm not sure if that's where the development focus is right now.
  http://tiddlywiki.com/ [tiddlywiki.com]
  
  Parent
  - Re:tiddlywiki, html, and search engines. Re:tiddlywiki, html, and search engines. (Score: 2) by hendrikboom on Thursday March 05 2015, @05:29PM
    
    by hendrikboom (1125) on Thursday March 05 2015, @05:29PM (#153572) Homepage Journal
    
    I'm aware of tiddlywiki. As I understand it, the entire text corpus resides in the same file as the tiddlywiki code. Which raises the question: How to update the code when there's a new version? I suppose one legitimate answer is that you don't.
    Tiddlywiki may well be useful for things like collecting ideas and notes in the early stages of project planning. And those notes will be useful if I come back to the project years later for any reason.
    What I've started doing is writing a lot of text into my programming projects. This might very well fit into tiddlywikis. There's text about the project design. There's a speculative diary, in which I record my thoughts about where the project might be going -- what approaches there are to various problems, what tentative stopgaps might be used until I get to a real solution, and so forth. Some of those are plain text, some are hand-coded simple html, and some are html generated from asciidoc.
    They are linked through various directories. (Moving a directory is always awkward because of the html links between them.) At the higher levels, these are more project lists than implementation details.
    Attaching these to a search engine with content analysis might be an effective way of indexing at least this part of the file base. Is there some way of embedding hand-made content tags into the html or asciidoc so they stay together with the files? There probably is.
    Anyone know any handy libre content analysis software?
    -- hendrik
    
    Parent
    - Re:tiddlywiki, html, and search engines. (Score: 0) by Anonymous Coward on Saturday March 07 2015, @09:22PM
      
      by Anonymous Coward on Saturday March 07 2015, @09:22PM (#154230)
      
      In TiddlyWiki, when there's a new version released, you can grab a copy of the new version file, and import your old project.
      
      Parent
Re:git Re:git (Score: 3, Insightful) by FatPhil on Wednesday March 04 2015, @11:21PM

by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Wednesday March 04 2015, @11:21PM (#153288) Homepage

At no point do I see the concept of version control being relevant at all.

I recommeng git for pretty much everything that requires version control, but it's a hammer to this screw.

To the OP - learn how to organise a director heirarchy, and symbolic links, so that you can access "a and b" things via both "a/b/" and "b/a", and if that's not good enough, then learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.

--
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves

Parent
- Re:git Re:git (Score: 3, Interesting) by sigma on Thursday March 05 2015, @01:08AM
  
  by sigma (1225) on Thursday March 05 2015, @01:08AM (#153328)
  
  learn how to not store so much shit that you can't access it simply through knowledge of the two most important keywords.
  Sometimes that's not an option, and yep, git is wrong for this purpose. As I see it, OP has two paths to choose from: A system that allows for complex and versatile file/document management, or a system that minimises the need to manage the files.
  If managing the files is the goal, then a Content Management System (CMS) is the answer, and one open source CMS that should do the job is Alfresco - http://www.alfresco.com/. [alfresco.com] There may be others, but I know Alfresco will do everything OP asked for and more.
  The alternative path is to forget about directly managing the files, and to use an indexing search too, to make sense of it. My approach is the leave my files in relative chaos and let Apache's Solr/Lucerne - http://lucene.apache.org/solr/ [apache.org] solve that problem. I'm pretty sure it would achieve OP's goal, though in a different (and possibly challenging for those with OCD tendencies.
  
  Parent
  - Re:git (Score: 2) by FatPhil on Thursday March 05 2015, @08:04AM
    
    by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Thursday March 05 2015, @08:04AM (#153443) Homepage
    
    Haha - my day job currently is to write an alternative to solr/lucene! Alas, I don't think it will ever be seen outside the context of email.
    
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    
    Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Ask Soylent: Looking for Document and File Organisation Tools

git git (Score: 1, Disagree) by Anonymous Coward on Wednesday March 04 2015, @07:55PM

Re:git Re:git (Score: 3, Interesting) by VLM on Wednesday March 04 2015, @08:33PM

Re:git Re:git (Score: 2) by buswolley on Wednesday March 04 2015, @08:39PM

Re:git (Score: 0) by Anonymous Coward on Wednesday March 04 2015, @09:14PM

Re:git Re:git (Score: 2) by VLM on Wednesday March 04 2015, @09:18PM

Re:git Re:git (Score: 2) by hendrikboom on Wednesday March 04 2015, @11:07PM

Re:git Re:git (Score: 2) by buswolley on Thursday March 05 2015, @01:07AM

Re:git (Score: 3, Informative) by khedoros on Thursday March 05 2015, @04:48AM

Re:git (Score: 3, Interesting) by hendrikboom on Wednesday March 04 2015, @11:13PM

Re:git Re:git (Score: 4, Interesting) by hendrikboom on Wednesday March 04 2015, @11:28PM

Re:git Re:git (Score: 2) by frojack on Thursday March 05 2015, @04:33AM

Re:git (Score: 2) by hendrikboom on Thursday March 05 2015, @01:07PM

Re:git Re:git (Score: 1, Informative) by Anonymous Coward on Thursday March 05 2015, @04:36AM

Re:tiddlywiki, html, and search engines. Re:tiddlywiki, html, and search engines. (Score: 2) by hendrikboom on Thursday March 05 2015, @05:29PM

Re:tiddlywiki, html, and search engines. (Score: 0) by Anonymous Coward on Saturday March 07 2015, @09:22PM

Re:git Re:git (Score: 3, Insightful) by FatPhil on Wednesday March 04 2015, @11:21PM

Re:git Re:git (Score: 3, Interesting) by sigma on Thursday March 05 2015, @01:08AM

Re:git (Score: 2) by FatPhil on Thursday March 05 2015, @08:04AM