What free software is there in the way of organizing lots of documents?
To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...
Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.
More down the page...
But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.
Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.
Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.
(Score: 4, Insightful) by frojack on Wednesday March 04 2015, @07:58PM
Ah, the curse of terabyte drives.
First some fill in the blanks questions:
Who (how many) are going to be using this? Just you? A small office of people?
Where are all the files? Internal drive, NAS? Across a local area network?
And, (drum roll) What OS does it have to reside on (not the documents, but the software)?
All of these matter a great deal.
No, you are mistaken. I've always had this sig.
(Score: 2) by hendrikboom on Wednesday March 04 2015, @08:20PM
Who (how many) are going to be using this? Just you? A small office of people?
Just me. But I do hope that if anything useful comes of it, it'll be useful to others as well.
Where are all the files? Internal drive, NAS? Across a local area network?
Most are on a server that's currently running Debian Wheezy. It provides both NFS and sshfs. Some are on a laptop, which usually runs Linux. A very few on a Windows partition on that laptop. I have control over the server and the Linux part of the laptop. And there are multiple (but not perfectly up-to-date) copies on backup drives, which are normally *not* attached to the machine.
And, (drum roll) What OS does it have to reside on (not the documents, but the software)?
Something that can access the server, if not the server itself.
(Score: 2) by frojack on Wednesday March 04 2015, @11:26PM
Someone recommended recoll to me. http://software.opensuse.org/package/recoll [opensuse.org]
Everything I know about it at this point I learned from that link. It mentions no database, which to me suggests its going to crawl your disk for everything you search, which if true is not acceptable.
No, you are mistaken. I've always had this sig.
(Score: 2) by hendrikboom on Wednesday March 04 2015, @11:42PM
recoll appears to be a front end to xapian, which is an interesting tool in itself. The recoll page you linked to says it doesn't use a data base, but xapian does. And the xapian page says it can handle data bases larger than 2G which is essential for large document collections. So I do suppose it uses a data base and doesn't scan the entire file system for every query.
(Score: 2) by frojack on Thursday March 05 2015, @04:27AM
I understand (from the suse mailing list) that it has a data base, but does not need a database server running all the time. It doesn't store the documents, it just indexes them in a database of its own making. You can schedule scans, or use the real-time inode monitoring feature which is pretty new:
http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.INDEXING.MONITOR [lesbonscomptes.com]
More info here: http://www.lesbonscomptes.com/recoll/features.html [lesbonscomptes.com]
I might have to play with this. B
No, you are mistaken. I've always had this sig.
(Score: 2) by sigma on Thursday March 05 2015, @01:15AM
I've posted suggestions above, but after reading this, I think you should look at Solr on the server and maybe Chrome POSTMAN for the clients.