Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 04 2015, @07:27PM   Printer-friendly
from the over-to-you dept.

What free software is there in the way of organizing lots of documents?

To be more precise, the ones I *need* to organize are the files on hard drives, though if I could include documents I have elsewhere (bookshelves and photocopy files) I wouldn't mind. They are text documents in a variety of file formats and languages, source code for current and obsolete systems, jpeg images, film clips, drawings, SVG files, files, object code, shared libraries, fragments of drafts of books, ragged software documentation, works in progress ...

Of course the files are already semi-organized in directories, but I haven't yet managed to find a suitable collection of directory names. Hierarchical classification isn't ideal -- there are files that fit in several categories, and there are a lot files that have to be in a particular location because of the way they are used (executables in a bin directory, for example) or the way they are updated or maintained. Taxonomists would advise setting up a controlled vocabulary of tags and attaching tags to the various files. I'd end up with a triples store or some other database describing files.

More down the page...

But how to identify the files being tagged? A file-system pathname isn't enough. Files get moved, and sometimes entire directory trees full of files get moved from one place to another for various pragmatic reasons. And a hashcode isn't enough. Files get edited, upgraded, recompiled, reformatted, converted from JIS code to UTF-8, and so forth. Images get cropped and colour-corrected. And under these changes they should keep their assigned classification tags.

Now a number of file formats can accommodate metadata. And some software that manipulates files can preserve metadata and even allow user editing of the metadata. But more doesn't.

Much of it could perhaps be done by automatic content analysis. Other material may require labour-intensive manual classification. Now I don't expect to see any off-the-shelf solution for all of this, but does anyone have ideas as to how to accomplish even some of this? Even poorly? Does anyone know of relevant practical tools? Or have ideas towards tools that *should* exist but currently don't? I'm ready to experiment.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Informative) by frojack on Wednesday March 04 2015, @09:07PM

    by frojack (1554) on Wednesday March 04 2015, @09:07PM (#153242) Journal

    Agreed, for practical day to day use, organized tree tree structured directories require a lot of knowledge about the structure to be useful. Even then, finding anything in a big directory is a problem. But at least with Linux he can link the same document to multiple directories without duplicating the files. That too takes a lot of knowledge.

    Better is a good indexing system, the topic of this post, and a request for tools.

    With a good tool, you should be able to search not only for file names, but also file types, and, most importantly Content, and any tags or meta data you may have added. Most of all, it can't be relying on search-while-you-wait technology (find). It has to have an indexed database and be self maintaining.

    With such a tool, a good one, you could resort to just using a document heap. (Not recommending that, but you maybe could).

    I too am interested in such a tool, but its got to be self managing, widely accessible, fast, multiplatform too.

    ----------------
    I've found Baloo pretty good for this. (KDE4, OpenSuse). It is their new desktop indexing system to replace the prior iterations of Nepomuk which were horrible processor hogs. https://community.kde.org/Baloo [kde.org]

    It will index file name and CONTENT. So I just point it at every directory I want to be able to find stuff in. Including my source code directories. Want to know every use of a particular function call? Type its name in the Dolphin search, and (because its all fully indexed) your list of files is populated instantly.

    I use it mostly for content search. But I haven't tried it to see if it will index nfs mounts, and if so, at what network bandwidth price, and what happens when you dismount the hfs volume.

    --
    No, you are mistaken. I've always had this sig.
    Starting Score:    1  point
    Moderation   +2  
       Informative=2, Total=2
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 2) by Immerman on Thursday March 05 2015, @02:24AM

    by Immerman (3985) on Thursday March 05 2015, @02:24AM (#153344)

    Agreed, a good search tool is a wonderful option. My own preference is the freeware tool Everything - on launch it nigh-instantly lists every file on the selected drives, and as fast as you can type it winnows the list to only those items whose name/path contains the specified word-fragments (alternately it can employ regular expressions). And thanks to the fact that it exploits the NTFS journal to make index updates almost instantaneous, and it takes under a minute to build from scratch an index of hundreds of thousands of files, even on a slow HDD. And it typically takes only a few seconds for updates, even auto-detecting most changes immediately.

    Downsides are:
      - it can only search file names, but get into the habit of appending meta-tags to the end of the name and you can rest assured that you have a relatively future-proof and platform-agnostic "database": pretty much any search tool will work beautifully with file names
    - the NTFS journal exploitation only works on local NTFS drives, and seemingly only under Windows. For other file systems, or when running under WINE, you need to configure periodic index updates, or initiate them manually. (you also need admin rights to exploit the journal, though it can be configured to run as a service accessible to any user)

    It takes some tweaking to work decently under WINE [configure open files/folders on its context menu to run $exec(winebrowser "%1") ] , and still leaves much to be desired, but nothing else I've found is remotely in the same league. I'd love to find a comparable native alternative, but thus far I've come up empty handed.

    • (Score: 2) by frojack on Thursday March 05 2015, @04:50AM

      by frojack (1554) on Thursday March 05 2015, @04:50AM (#153393) Journal

      Probably not what the OP needs, since he is on linux.

      Often you don't have options of appending crap to the name. For instance, I have to index tons of source code. You don't get to change those names.
      Legal documents are another thing you really can't mess with.

      And a file name only search means you pretty much have to know the file name. Which is not likely going to be the case once you get beyond a few hundred thousand files.

      --
      No, you are mistaken. I've always had this sig.
      • (Score: 2) by Immerman on Thursday March 05 2015, @05:13AM

        by Immerman (3985) on Thursday March 05 2015, @05:13AM (#153406)

        I put the tweaks needed for WINE there for a reason - I mostly run Linux, and have pretty much given up finding a comparable native tool, and it's far to useful to abandon.

        As for file name limitations - yes, source code is a challenge, though I can't say I've seen any content-indexing systems that are worth much for source either, but fortunately it does tend to lend itself to hierarchical organization along module lines. For most everything else, I've begun abandoning short, condensed names in favor of actual descriptive ones that don't need to be memorized: "Trigonometry and Geometry quick-reference sheet v14.9.svg" will almost certainly be in the top ten results when I type "geo ref tri", and if not I'll just keep typing more requirements.

        One of the absolutely essential components of Everything is that I don't have to initiate a search - it shows me the matching results as fast as I type, starting with a list of all 200,000+ files on my computer when first launched. Even the fastest results after some "initiate search" trigger (hitting [Enter] or whatever) offer a qualitatively inferior experience.

      • (Score: 2) by hendrikboom on Thursday March 05 2015, @07:25PM

        by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @07:25PM (#153606) Homepage Journal

        Even though Everything is Windows-only, it's still worth discussing, because its salient features could perhaps be implemented in a Linux program. The OP *is* interested in tools that should exist but don't, after all.

        -- the OP

        • (Score: 2) by frojack on Thursday March 05 2015, @10:48PM

          by frojack (1554) on Thursday March 05 2015, @10:48PM (#153671) Journal

          The OP *is* interested in tools that should exist but don't, after all.

          The tools do exist. the OP just doesn't know what/where they are.

          --
          No, you are mistaken. I've always had this sig.
    • (Score: 2) by hendrikboom on Thursday March 05 2015, @06:52PM

      by hendrikboom (1125) Subscriber Badge on Thursday March 05 2015, @06:52PM (#153595) Homepage Journal

      Everything looks useful.

      It might not be impossible to port or rewrite for Linux.

      Except for the keystroke interaction, it seems to do the same as locate and updatedb. I wonder if it would be easy to write code that does the keystroke interaction to search locate's database.

      • (Score: 2) by Immerman on Thursday March 05 2015, @08:12PM

        by Immerman (3985) on Thursday March 05 2015, @08:12PM (#153632)

        I suspect it uses progressive winnowing - i.e. only searching the already reduced list as each new letter is added, which seems to run counter to Locate's interface at least - though I'll admit I never figured out how to get Locate to work as even a decent traditional search tool.

        • (Score: 2) by hendrikboom on Thursday March 12 2015, @07:31PM

          by hendrikboom (1125) Subscriber Badge on Thursday March 12 2015, @07:31PM (#156849) Homepage Journal

          Another possibility is to use a mix of lazy and eager evaluation. When the first character is typed, all you need to find is the small number of entries that will actually fit on the screen. You can go on generating the rest of the list while waiting for the next character. When the second character is typed, you go through the list of remaining entries (which may only be partially generated) until you have a new screenful of surviving entries. Then while waiting for the third character, you go on generating this new list, switching gears when you get to the end of the second list and proceeding with the first, and when it expires, going on with the original index.

          And so on.

          • (Score: 2) by Immerman on Friday March 13 2015, @01:11AM

            by Immerman (3985) on Friday March 13 2015, @01:11AM (#157075)

            An excellent idea, but not I think one that it's using: the status bar immediately displays the number of matches, which wouldn't be possible with lazy evaluation. Though I would assume that the actual UI displayed list is generated on the fly - with as much data as is displayed for each file the comprehensive list would require far more RAM than is actually being used.

            Still, once I thought about it, if your average filename is, say, 20 characters, and a 1GHz processor can compare 1 billion characters per second, then that's 50 million files that could be scanned in a second: my paltry 150,000 files would take less than 1/300th of that. So a simple exhaustive search could potentially do the job quite tidily, especially if you assume path names are searched separately and then cross-referenced with the files they contain.