Stories
Slash Boxes
Comments

SoylentNews is people

posted by LaminatorX on Monday October 13 2014, @07:56PM   Printer-friendly
from the The-past-is-never-dead.-It's-not-even-past. dept.

Of late, the volume of my Internet-based correspondence has been showing serious growth, and I've begun taking the preservation and archival of the products more seriously -- email conversations, message board threads, even some IRC discussions. I would like to use a unified system to archive these text-oriented communications. What solutions do Soylentils use or suggest? (I'm a Linux- and Vim-user, although discussion about systems or tools on any OS are welcome.)

Requirements:
-Browseable: A must! The value of easily revisiting past communications is immesureable. .zip'd-type formats may save space, but even years of text-based personal correspondance don't amount to much in comparison to a few music albums or feature-length movies.
-Stored in a long-lasting format/encoding: ASCII won't work as Internet-based communication often contains structural elements like links and lists, not to mention RTF-style formatting. HTML seems like a good start.
-Maintains linear structure of discussion threads
-Searchable: Last and definitely least-necessary feature -- 'grep' is always an easy first resort :)
-Tag-able: If search features are built-in, this is an obviously valuable feature.

Related Stories

A Deep Dive into the History and Evolution of Zip Compression 20 comments

Hans Wennborg does a deep dive into the history and evolution of the Zip compression format and underlying algorithms in a blog post. While this lossless compression format became popular around three decades ago, it has its roots in the 1950s and 1970s. Notably, as a result of the "Arc Wars" of the 1980s, hitting BBS users hard, the Zip format was dedicated to the public domain from the start. The main work of the Zip format is performed through use of Lempel-Ziv compression (LZ77) and Huffman coding.

I have been curious about data compression and the Zip file format in particular for a long time. At some point I decided to address that by learning how it works and writing my own Zip program. The implementation turned into an exciting programming exercise; there is great pleasure to be had from creating a well oiled machine that takes data apart, jumbles its bits into a more efficient representation, and puts it all back together again. Hopefully it is interesting to read about too.

This article explains how the Zip file format and its compression scheme work in great detail: LZ77 compression, Huffman coding, Deflate and all. It tells some of the history, and provides a reasonably efficient example implementation written from scratch in C. The source code is available in hwzip-1.0.zip.

Previously:
Specially Crafted ZIP Files Used to Bypass Secure Email Gateways (2019)
Which Compression Format to Use for Archiving? (2019)
The Math Trick Behind MP3s, JPEGs, and Homer Simpson's Face (2019)
Ask Soylent: Internet-communication Archival System (2014)


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by kaszz on Monday October 13 2014, @08:03PM

    by kaszz (4211) on Monday October 13 2014, @08:03PM (#105694) Journal

    Newer email conversations and message board threads are polluted with a lot of html tags. It might be an idea to at least reduce the amount. They may add little but use a lot of space and processing power.

    • (Score: 0) by Anonymous Coward on Monday October 13 2014, @08:11PM

      by Anonymous Coward on Monday October 13 2014, @08:11PM (#105695)

      I don't think that's worth the effort. Even a 300% increase in size on a text message is nothing when 2TB drives are well under $100.

      • (Score: 2) by kaszz on Monday October 13 2014, @08:14PM

        by kaszz (4211) on Monday October 13 2014, @08:14PM (#105696) Journal

        I think you underestimate the multiplier factor.

        • (Score: 2) by Freeman on Monday October 13 2014, @09:34PM

          by Freeman (732) on Monday October 13 2014, @09:34PM (#105728) Journal

          Assuming all that is saved is the text and an average forum page of 80kb, he would have to save in excess of 342 forum pages per day for 100 years to fill a 1 Terabyte Hard Drive. I'm thinking he's probably pretty good without scrubbing excess data.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
          • (Score: 2) by kaszz on Monday October 13 2014, @11:32PM

            by kaszz (4211) on Monday October 13 2014, @11:32PM (#105756) Journal

            Harddisks may not be the best long term solution.

            • (Score: 1) by pnkwarhall on Tuesday October 14 2014, @01:46AM

              by pnkwarhall (4558) on Tuesday October 14 2014, @01:46AM (#105792)

              So, optical disks then? There's no way I'm buying a tape storage machine ;)

              My reply is only slightly snarky. I think that, with irregular backups & storage-medium maintenance, that I would have a good chance of having (most of) my data around for much of the time that it would be of interest to me. That being said, if I printed it out onto stacks of acid-free (or whatever) paper, and stored it in some air/fire/waterproof boxes somewhere relatively safe, I'm sure that it would have a much better chance for a longer shelf life, i.e. it sticking around and being read by someone in a future time. You know, like **books** (journals/manuscripts/scrolls/stonetablets/etc).

              The truth is that digital data is a pretty fragile storage medium. If I want to keep my communications around for the true long-term (late-life/posterity or whatever) I should take a serious look into hardcopy archival.

              It's not searchable though!

              --
              Lift Yr Skinny Fists Like Antennas to Heaven
              • (Score: 2) by kaszz on Tuesday October 14 2014, @03:45AM

                by kaszz (4211) on Tuesday October 14 2014, @03:45AM (#105814) Journal

                Just make a bit pattern like with QR codes. But for a whole A4 page. Then you can store like 500 kB per page.

          • (Score: 3, Insightful) by edIII on Monday October 13 2014, @11:47PM

            by edIII (791) on Monday October 13 2014, @11:47PM (#105760)

            His point about processing is spot-on though.

            Once you depart from straight ASCII into anything that needs to be parsed, you are spending processing cycles reading it first. This is an archival system. I would imagine we would want to limit searches to actual content, not HTML tags and structure. Think you might be surprised how many processing cycles you save by stripping out all the crap, especially foreign javascript components and whatever else on the page not worth archiving. When you only have less than 10k records it may not be much, but when you have considerably more than that (say hundreds of thousands), those inefficiencies add up and affect how well your search algorithms can perform.

            I created a system similar to this to store snapshots of a 3rd party website. Think screen scraping, but with credentials and automated BI. After a couple hundred thousand pages stored you would be surprised how much cruft there is in a page that needs to be processed when you retrieve it and try to work with it. Then multiply that by everyone accessing it from a server running reports against stored data, as well as searches for specific pages. Of course, we haven't even addressed the security issues of storing script and then displaying it again in a browser. Unless, this is a native application with a purposefully built watered-down-safe-reader. There are a lot of reasons that styling and scripting is no longer valuable once the session is closed, and after website elements may evolve and change. It's a waste of time in all respects.

            In general, if you can design the process to only store that which you need, you are far better off. With this specific case I absolutely would parse it down to content and create a database with all the metadata you want. Display later on can be based on your own CSS and themed. Do the work once, and then never again.

            Other than that, yeah storage is cheap. It's not a good paradigm for scalability though.

            --
            Technically, lunchtime is at any moment. It's just a wave function.
            • (Score: 3, Insightful) by Freeman on Tuesday October 14 2014, @03:37PM

              by Freeman (732) on Tuesday October 14 2014, @03:37PM (#105962) Journal

              For a personal project, scrubbing the html and other junk may be overkill and a Huge waste of time. I would be doing good just to back it up in the first place. The more important thing would be to have backups of your backups. It's harder to accidentally destroy 2 sets of data in different locations than it is one set of data.

              --
              Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
              • (Score: 2) by edIII on Tuesday October 14 2014, @08:46PM

                by edIII (791) on Tuesday October 14 2014, @08:46PM (#106061)

                Perhaps, but your concern is addressed rather trivially. Just use a zero-knowledge service to back up your data to the "cloud". Most of them are using some pretty nice storage systems. Mine uses ZFS and all the data is contained in multiple pools. If you don't have the $60-$70 a year, and hundreds of dollars in parts laying around, you can also build your own ZFS system. Worst case, go to Frys and shell out a couple hundred on a nice NAS. Then there is the "all of the above option" that provides redundancy for a lot more than an archival system.

                Also, it's not like scrubbing it takes all that much effort. We are talking about a few lines of code and maybe a custom function or two. I think I wrote mine in about 20 minutes and had it tuned by lunch. The really *hard* stuff is already taken care of by other libraries and functions available in most platforms, especially scripting languages.

                What's also nice about spending the processing cycles, not making it a "Huge [sic] waste of time", is that you can more easily use web browsers as an interface to display the data. Otherwise you are injecting running scripts, or working with styling and elements hosted on foreign servers or stale CDN references most likely. Extra points for pulling down basic graphics and constructing a regex replace statements to alter their src tags.

                Leaving that stuff in affects storage, processing, and design considerations for your archival system. How many more reasons do you need to expend the resources putting together trivial code?

                Finally, the design restrictions in TFS state it must be browseable and maintain linear structure of discussion threads. With all that processing you need to do regardless, it makes little sense to spend actual resources transferring the cruft to the database and storage. You already have identified structure and content. Just save them and ditch the raw copy which requires base64 encoding if you want it to survive transport through various web APIs, Google among them.

                The cons of working with raw copy are pretty steep, while the pros of processing it get you tags, searchability, and structure. Pretty good argument for a small amount of work. Having been through this problem myself, I highly suggest some preprocessing and well thought out database designs.

                Ultimately you are right, it may not *have* to scale. I don't see it as a reason to design non-scalable software when the effort is truly small by experience.

                --
                Technically, lunchtime is at any moment. It's just a wave function.
          • (Score: 1) by hendrikboom on Tuesday October 14 2014, @07:07PM

            by hendrikboom (1125) Subscriber Badge on Tuesday October 14 2014, @07:07PM (#106038) Homepage Journal

            And WOW! Before those hundred years are up, he might even be able to get a sweet deal on a TWO-terabyte hard drive!

            Affordable disk capacity on hard drives seems to be increasing faster than the data I need to store.

            But not on tablets.

            -- hendrik

  • (Score: 2, Interesting) by Anonymous Coward on Monday October 13 2014, @08:16PM

    by Anonymous Coward on Monday October 13 2014, @08:16PM (#105697)

    Create a personal Wiki using a page for each date, almost like a journal or diary.
    You can run Dokuwiki and a small HTTP+PHP stack off a flash drive.

    Perhaps even using information/knowledge management software might help you. I dig TreeSheets.

    • (Score: 2) by SlimmPickens on Monday October 13 2014, @09:44PM

      by SlimmPickens (1056) on Monday October 13 2014, @09:44PM (#105732)

      That treesheets looks the goods

      The ultimate replacement for spreadsheets, mind mappers, outliners, PIMs, text editors and small databases.

      TreeSheets is exceptionally small & fast, so can sit in your system tray at all times: with several documents loaded representing the equivalent of almost 100 pages of text, it uses only 5MB of memory on Windows 7

  • (Score: 3, Interesting) by zafiro17 on Monday October 13 2014, @08:19PM

    by zafiro17 (234) on Monday October 13 2014, @08:19PM (#105699) Homepage

    I'm not sure why ASCII text has been ruled out in the summary. It seems to me that although OP wants to include things like RTF and HTML, a lot of what's going to get archived - IRC logs, email, etc. - are already plain text. Then, you can do some pretty amazing things with a Unix system and a folder full of text files - start with grep, and have fun.

    I can't imagine on standardizing on anything other than text, and now that we're living in the land of unicorns, dark beer, and the Unicode standard, I'd convert everything to UTF-8 to make sure you don't have any annoying 'code page' errors and live happily ever after.

    --
    Dad always thought laughter was the best medicine, which I guess is why several of us died of tuberculosis - Jack Handey
    • (Score: 2) by kaszz on Monday October 13 2014, @08:27PM

      by kaszz (4211) on Monday October 13 2014, @08:27PM (#105701) Journal

      "I can't imagine on standardizing on anything other than text"
      Ah!, you mean like binary log files.. a systemd speciality ;)

      Many people use html in email because they can't express themself without icons and advanced formatting. So many emails currently contains html.

      • (Score: 2) by Appalbarry on Monday October 13 2014, @08:39PM

        by Appalbarry (66) on Monday October 13 2014, @08:39PM (#105708) Journal

        Many people use html in email because they can't express themself without icons and advanced formatting.

        Arguably many people use various formatting options because there are times when it's the best way to communicate a specific idea to a specific recipient.

        Regardless of your comprehension levels, there are still times when the judicious use of bold, italic, or even a bulleted list, can make a message, or the critical parts of a message, more easily understandable.

        • (Score: 2) by hemocyanin on Monday October 13 2014, @08:55PM

          by hemocyanin (186) on Monday October 13 2014, @08:55PM (#105715) Journal

          You are totally correct, though I see you didn't justify :(

          That to the side however, there are really three issues with the question posed: storage, retrieval, and display.

          • Storage: best if the data is stored in a manner that has a decades long history of compatibility (like ASCII). Markup is great for formatting and happily, if the markup data is stored in ASCII format, it can be stored embedded in the content. If the markup is private, store a local copy of that if possible, or better yet, convert to a public ubiquitous form of markup.
          • Retrieval: ASCII text is easy to search and process.
          • Display: markup is a pain to look at. The data viewer must have sufficient capability to interpret the various forms of markup used. This is the hard part. Everything else is solved by ASCII.
          • (Score: 2) by VLM on Monday October 13 2014, @09:03PM

            by VLM (445) on Monday October 13 2014, @09:03PM (#105718)

            You guys are funny, but I get the feeling op is asking for what amounts to a personal dictionary or thesaurus not a training textbook.

            So... you're trying to educate someone in the philosophy and history of unix, or train them on how to use the man command, then you need a graphics artists work on typography. If on the other hand you're just trying to provide a cheatsheet that man page sec 2 is system calls and 3 is std library calls (or maybe I got it backward yet again, it doesn't really matter anyway...) then effort spent turning it into a crafting project is pretty much wasted.

        • (Score: 2) by kaszz on Monday October 13 2014, @11:20PM

          by kaszz (4211) on Monday October 13 2014, @11:20PM (#105752) Journal

          Yes *bold* /italic/ or
            * Even
            * A
            * Bulleted list

          Can have it uses.. :p

          • (Score: 1) by pnkwarhall on Tuesday October 14 2014, @01:20AM

            by pnkwarhall (4558) on Tuesday October 14 2014, @01:20AM (#105785)

            This is many times how I write **to myself**. But there's no way I'm going to spend time converting messages from HTML or RTF to ASCII-w/-personal-markup just for a unified storage solution.

            --
            Lift Yr Skinny Fists Like Antennas to Heaven
            • (Score: 2) by kaszz on Tuesday October 14 2014, @03:43AM

              by kaszz (4211) on Tuesday October 14 2014, @03:43AM (#105812) Journal

              One use automation to do this. The name of game for archives and "big data" is algorithms and automation.

      • (Score: 2) by hemocyanin on Monday October 13 2014, @08:41PM

        by hemocyanin (186) on Monday October 13 2014, @08:41PM (#105709) Journal

        Many people use html in email because they can't express themself without icons and advanced formatting. So many emails currently contains html.

        So? There is no yellow smiley face that shows up when I type :-) UNLESS the interpreter substitutes those ASCII characters for a binary image of a smiley face. The smiley face still exists of course as a textual element -- maybe those three characters but it could also be some type of markup ***smiley***, [smiley], , etc. etc. In other words, nothing is lost even if you are archiving the content of someone who can't communicate without ;-) -- indeed, perhaps there is nothing worth storing in that case.

      • (Score: 2) by frojack on Monday October 13 2014, @08:54PM

        by frojack (1554) on Monday October 13 2014, @08:54PM (#105714) Journal

        True, but with any proper indexing functions the html doesn't always (or often) get in the way of a search for content.
        It might be a Problem if your correspondent insists on tossing a boat load of crap formatting into the words you want to search, but that seems unlikely.

        After many years of simply sucking way too much resources to be practical, the text indexing that is currently found in KDE4 (baloo indexer and friends) is now stable and lightweight enough that I can point it at all my source code and even kernel source trees to find stuff very fast. I couldn't possibly grep this stuff, since it is scattered over a variety of directories, and grepping that much stuff consumes an inordinate amount of disk IO.

        --
        No, you are mistaken. I've always had this sig.
      • (Score: 2) by LoRdTAW on Monday October 13 2014, @09:38PM

        by LoRdTAW (3755) on Monday October 13 2014, @09:38PM (#105730) Journal

        How else am I supposed to send email with a pink glitter background?

        • (Score: 2) by VLM on Monday October 13 2014, @09:52PM

          by VLM (445) on Monday October 13 2014, @09:52PM (#105735)

          Silly LoRdTAW, everybody knows you use elmers glue, just like you correct word processing documents by putting whiteout on the screen

        • (Score: 1) by pnkwarhall on Tuesday October 14 2014, @01:48AM

          by pnkwarhall (4558) on Tuesday October 14 2014, @01:48AM (#105793)

          lol, you can send it with pink glitter, but I ain't savin' it that way!

          --
          Lift Yr Skinny Fists Like Antennas to Heaven
          • (Score: 2) by LoRdTAW on Tuesday October 14 2014, @12:38PM

            by LoRdTAW (3755) on Tuesday October 14 2014, @12:38PM (#105905) Journal

            We actually had a secretary use a glittery orange background in her emails when she first started. It was a visual assault on the eyes, a crime worthy of punishment. I told her to remove the background and she changed it to pink glitter instead. thankfully she sent the boss an email and he marched straight in and made sure she did not use a background.

  • (Score: 5, Funny) by zafiro17 on Monday October 13 2014, @08:26PM

    by zafiro17 (234) on Monday October 13 2014, @08:26PM (#105700) Homepage

    Had another idea (no charge for you):

    Take all that information, copy and paste it into one Microsoft Word document. It's going to get pretty long, but you can break apart the sections visually by including some clip-art, or you can use section headers using "Word Art" that makes it easy to see. Just try to remember the right combination of font and bold/italic/font size/etc. for each header so you can keep it the same if possible. Highlight parts that you think are important.

    That document is going to get pretty big pretty fast. You can keep the file size down by using a smaller font. Comic sans uses less disk space because the letters are smaller. Now, to back it up and keep all that information safe: every now and then email it to yourself from your Hotmail account. Don't bother zipping it - sometimes zip files get mistaken for viruses. That way you'll have two copies of it - the old version in your "sent messages" and the new, fresh one in your inbox.

    Good luck! If you have any more questions, come ask them at our new Yahoo Group: "internetexperts." We are the roxxers.

    --
    Dad always thought laughter was the best medicine, which I guess is why several of us died of tuberculosis - Jack Handey
    • (Score: 0) by Anonymous Coward on Monday October 13 2014, @09:32PM

      by Anonymous Coward on Monday October 13 2014, @09:32PM (#105726)

      emacs

      • (Score: 2) by VLM on Monday October 13 2014, @09:49PM

        by VLM (445) on Monday October 13 2014, @09:49PM (#105733)

        "emacs"

        Yeah OK AC, the truly hard core are going to hoard in mysql witih binary blobs and do the whole CRUD stack (well not so much D) right on the command line

        It is of vital importance agent VLM that you determine what you told that chick you were breaking up with back in '95 when you were drunk, why, well, I donno.

        "SELECT content FROM hoard WHERE YEAR(when)="1995" AND who="That hot red head coworker I dated before I met my wife" AND type="email" AND (mental_state="drunk_after_breakup" OR mental_state="drunk" OR mental_state="post_breakup") AND (content LIKE '%apology%' OR '%kissing that other chick%') ORDER BY when DESC LIMIT 50\G"

        And indexes are for wusses just throw hardware at it if you don't get a result quick enough it probably doesn't matter anyway.

        • (Score: 2) by DECbot on Tuesday October 14 2014, @01:51PM

          by DECbot (832) on Tuesday October 14 2014, @01:51PM (#105926) Journal

          You know, I was planning to suggest a database. But after seeing your post, even with a php or python frontend, it'd be overkill. Just upload it all to google, let them sort it out.

          --
          cats~$ sudo chown -R us /home/base
  • (Score: 2) by hemocyanin on Monday October 13 2014, @08:36PM

    by hemocyanin (186) on Monday October 13 2014, @08:36PM (#105704) Journal

    Maybe I'm just getting pedantic about the "stored in a long-lasting format/encoding" line -- but it seems to me that HTML _is_ ASCII. Web content (aside from binary parts like photos, images, videos, or sounds) is just ASCII text embellished with markup that your browser displays by certain rules for that markup. Decoding the markup is the job of your browser, not something your storage system needs to care about. Even word processing documents, at least those that aren't stored in binary, are just plain text with markup that the editor (word processor) interprets to generate printable results. Unzip a docx or an odt and you will find a plain text file you can manipulate (or generate) directly.

    Plain text files are the most useful and resilient digital things out there. I can't read the files I wrote in grad school in the mid 90s on a word processor whose name I can't recall, but anything that is in plain text is just fine. Store as much as you can in plain text and leave the display issues to your data viewer.

  • (Score: 3, Funny) by c0lo on Monday October 13 2014, @08:38PM

    by c0lo (156) Subscriber Badge on Monday October 13 2014, @08:38PM (#105707) Journal
    I hear that NSA has the same problems as yours and they solved them somehow.
    Maybe you can negotiate a reasonable fee to allow you access their service?
    --
    https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
  • (Score: 3, Informative) by black6host on Monday October 13 2014, @08:45PM

    by black6host (3827) on Monday October 13 2014, @08:45PM (#105710) Journal

    If you're on Windows (or Mac, don't know if either has a linux version or can be made to run under linux) you might consider OneNote or Evernote. They both allow you to categorize the info you're collecting, will handle HTML and are quite handy. I started with OneNote but moved to Evernote as it just met my needs better with its android version. (There is a version of both, I believe, for the Mac and iPhones).

    There are a lot of formatting options in either one but I don't really use them that much. I just want to keep track of things like receipts for purchases and when bills are paid. I also believe that either one will let you clip from your browser straight into the program. At least I know it works with Firefox and the Windows version of Evernote.

    You can embed sound, pics, movies, spreadsheets, docs, etc. Evernote does have a free version but if you want some of the extra features, like keeping the full data set on your phone instead of having to download the data you want each time you access it then Evernote has a premium annual subscription option that provides that and other features.

    Either one, though, are very good at keeping all kinds of disparate information together and organized.

    • (Score: 2) by frojack on Monday October 13 2014, @09:19PM

      by frojack (1554) on Monday October 13 2014, @09:19PM (#105721) Journal

      Plus One for Evernote (nixnote on linux) but its not up to this task on the scale the OP seems to want.

      First is the sheer volume of history would be expensive to store there, and difficult to arrange transfer of every single thing he seems to want to include.
      Plus there is this whole issue of it being off site, on a subpoena-able service.

      Personally, I don't want to store all that crap myself.
      Mailing list archives are already on line for the most part and Google can find the content way faster than any Grep. Private lists are a different matter.
      If it is public stuff (mailing lists, forums, etc) I just hand the job to Google.

      If its private stuff like email or proprietary code, I rely on other search tools over the years. Currently favoring the so called "semantic desktop" type of search capabilities of KDE4. https://community.kde.org/Baloo [kde.org]

      This scans and build indexes and allows tagging, etc. But the key thing is it allows me to leave the source in what ever form I acquired it, email, source code, pdfs word documents, etc. It will literally index it by every word, and keep the indexes up to date, and lets me add tags and notes without touching the original documents.

      Bu using indexing systems that can index a wide variety of sources without me having to put the documents into the indexer, I am free to move to any future solutions that come along. (There have been several over the years). As long as I have the source I can index it.

      All the other stuff I let Google index. I'm past the point of caring about communications that happened 20 years ago, and as soon as the OP gets over himself, he will probably realize the folly involved in indexing every word he wrote.

      --
      No, you are mistaken. I've always had this sig.
    • (Score: 2) by VLM on Monday October 13 2014, @09:24PM

      by VLM (445) on Monday October 13 2014, @09:24PM (#105722)

      don't know if either has a linux version or can be made to run under linux

      Evernote works great in Chromium. I've never used it other than in Chrome on a Linux desktop or the android app on phone and tablet. Soon I'll be trying it on Chrome (or whatever) on a freebsd desktop.

      The web page sucks you can tell they're solely interested in growth and getting bought for Billions because its very easy to find the sign up links and very hard to find the log in to your account links. This truth holds across all sites on the internet. In the long run when they're finally bought by microsoft and merged with minecraft or bought by amazon and integrated into the wishlist with patented one click ordering, I'll have to go back to org mode. But I like how evernote embeds pdfs and graphics files into notes (last time I tried org mode, it didn't) and the mobile app is pretty awesome although emacs-ish. So yeah their marketing is all growth hacking in buyout mode or whatever, which sucks, because its a cool service.

      By emacs-ish its widely known that all line noise is almost certainly a valid emacs keyboard command and the evernote mobile app has so many features and pull outs and pull downs and pull ups and pull it all around that any and all caffeine inspired twitch is almost certainly bound to some function or action, which is a little weird.

      Last time I tried mobile org mode it was version 1.0 or maybe 0.1 a very long time ago and it wasn't so smooth. I would be interested in a report from this decade. Especially since in the long run I'll probably have to move from evernote after they get bought.

    • (Score: 2) by Geotti on Monday October 13 2014, @11:50PM

      by Geotti (1146) on Monday October 13 2014, @11:50PM (#105761) Journal

      DevonThink is another option for mac users. It's pretty neat. http://www.devontechnologies.com [devontechnologies.com]

  • (Score: 3, Insightful) by Anonymous Coward on Monday October 13 2014, @08:47PM

    by Anonymous Coward on Monday October 13 2014, @08:47PM (#105711)

    Here is an idea. Delete the conversations. I have years of this stuff stored here and there. I rarely look at them. I never search thru them. I just use a search engine for any information I am currently looking for. Usually when I come across something archived like that from say 15 years ago? I blow it away. I do not have to worry about it.

    The *only* time it is worth keeping something is if someone is unreliable and I need to call them out on not doing something they said they would.

    Dont become a digital packrat. Take it from someone who is (about 2.3 TB down from 3.4). It is a headache. 10 years from now you will look at it and think 'why did I keep this conversation on the intricacies of insert cool movie here'. It will not matter.

    • (Score: 0) by Anonymous Coward on Monday October 13 2014, @09:29PM

      by Anonymous Coward on Monday October 13 2014, @09:29PM (#105724)

      >10 years from now you will look at it and think 'why did I keep this conversation on the intricacies of insert cool movie here'.

      This may not seem like helpful advice since it's not what OP asked for, but it's actually the best advice in the thread. Young people don't understand that there's no value in saving ancient conversations other than laughing at how clueless you were at the time.

      • (Score: 2) by Lagg on Monday October 13 2014, @09:53PM

        by Lagg (105) on Monday October 13 2014, @09:53PM (#105736) Homepage Journal

        Yeah, it's already bad enough to look at old code you wrote. Actual conversations is just silly and less useful to boot.

        --
        http://lagg.me [lagg.me] 🗿
        • (Score: 2) by hemocyanin on Tuesday October 14 2014, @12:12AM

          by hemocyanin (186) on Tuesday October 14 2014, @12:12AM (#105774) Journal

          I will sort of disagree, although I totally agree with your sentiment and I do not have any sort of archive of all the letters I sent to girlfriends, family, or friends in HS and College (this was back in the 70s and 80s when email was limited to the very few), nor less an archive of emails, texts, forum posts, etc. The thing is though, our understanding of history is enhanced by those few people who saved every letter they sent/received or stored massive archives of accounting records on paper or clay tablets or whatever. Who knows -- if this guy saves a complete archive in a means that is future proof, it might end up being a useful trove of information 1000 years from now (although I would think the chance of it being accessible, even if all stored in plain ASCII, to be extremely remote).

      • (Score: 2) by Tork on Tuesday October 14 2014, @04:45AM

        by Tork (3914) Subscriber Badge on Tuesday October 14 2014, @04:45AM (#105828)
        I'm am old person who followed your advice and now regrets it. A couple of my IRC friends have died and now I wish I could re-read our chats.
        --
        🏳️‍🌈 Proud Ally 🏳️‍🌈
    • (Score: 2) by frojack on Monday October 13 2014, @09:57PM

      by frojack (1554) on Monday October 13 2014, @09:57PM (#105737) Journal

      Pretty much spot on.

      If you think having this boatload of crap to will to your offspring, or write your memoirs, I can assure you that the embarrassment of looking at your early writings will discourage you more than sheer work involved. If you are thinking of running for office, its too frigging late already to start cleaning up your bread crumbs.

      Get over yourself, and just let it go.

      If someone challenges you about what you may have said or may have written 10 years ago, just say you don't remember, and it was probably in jest or drunkenness anyway, and move on. If you want it to challenge someone else on what they might have said 12 years ago, well... Don't be that guy.

      I can see this being useful for software development, lawyers, doctors, researchers, and patent trolls, but personal internet communication, email, and IRC? Really?

      --
      No, you are mistaken. I've always had this sig.
      • (Score: 1) by pnkwarhall on Tuesday October 14 2014, @02:43AM

        by pnkwarhall (4558) on Tuesday October 14 2014, @02:43AM (#105801)

        Your response was the most useful to me, particularly because the tone/attitude it was written in made me take a second look at "what" and "why" I want to save.

        For the "what", I only truly care about saving two types of writing: correspondence w/ close friends and family, and serious writing about the things I'm passionate about. Archival of the latter is easy, because it's generally formatted nicely for display in basic HTML, and I can simply print it out for truly long-term archival (& save the page in a digital collection). For personal correspondence, which I want to save for mostly nostalgic reasons, it would be a similarly simple task to just print them out and keep them in a binder or something...

        As for the "why", the (personal) reason is simple: I have a box in my closet. It's filled with my writings and correspondence from the years before I got online and started doing the vast majority of my writing on a computer. Sure, like someone said above, there's a bunch of personal writings in there whose primary value is to make me look at myself and appreciate how much I've grown from the young idiot I was. But there's also a bunch of stuff, personal and from family/friends, that I'm glad I saved, and that when read every few years reminds me of who I am, what I've done/created, and who I've loved. I don't want to lose that treasure trove as I get older just because I don't make hardcopies anymore... And maybe that's the simplest solution. As I like to tell people (I don't really take pictures myself) -- "If you really like a photograph, print it out. It'll last a lot longer and you'll appreciate it more often."

        I should follow my own advice with the writings I truly value. Thank you for your comment.

        --
        Lift Yr Skinny Fists Like Antennas to Heaven
        • (Score: 2) by frojack on Tuesday October 14 2014, @04:08AM

          by frojack (1554) on Tuesday October 14 2014, @04:08AM (#105816) Journal

          Glad you took it the as a positive comment. After I posted it, I realized it sounded a little blunt and uncaring. (Its a personal failing of mine).

          Maybe you should get a good All-in-one printer-scanner and scan the printed stuff that is really important to PDFs.
          Then use the printer and print the important digital stuff to some diary form for easier reading. Belt and suspenders.

          The best (affordable) scanning platform I've found is just about any of the above mentioned scanners, and XSane software on Linux. It can even do enough
          text recognition to make printed documents searchable. But with hand written stuff you have to manually add text notes because nobody does handwriting
          reco worth squat. I've been working on recovering an organization's paper archives into digital form.
          With XSane I've pulled spilled coffee stains out of the only remaining copy of historical document that have been in the organization since its inception.

          All the huge piles of Windows scanning software is crap. I've tried lots of them.
          There are also smartphone apps that do an amazingly good job of scanning documents to pdf, you can copy an entire sheaf of letters into PDF form in mere minutes.

          But don't be afraid to toss stuff. Digital or Paper. Its amazing the detritus people collect.

          --
          No, you are mistaken. I've always had this sig.
    • (Score: 0) by Anonymous Coward on Monday October 13 2014, @10:01PM

      by Anonymous Coward on Monday October 13 2014, @10:01PM (#105738)

      Your business contacts are keeping copies. Would you like to be in court without records of your own?

      I wouldn't waste much time processing the data, but some preparation for future search and retrieval, followed by archiving is what I would do.

    • (Score: 2) by LoRdTAW on Monday October 13 2014, @11:04PM

      by LoRdTAW (3755) on Monday October 13 2014, @11:04PM (#105750) Journal

      There are a lot of data rat packers, me being one. Though, here is one story that made rat packing worthwhile:

      A while ago my brother was working on a hex tile turn based strategy game engine. He wrote a map editing tool in c# and did the engine in C++/SDL/OpenGL. He made decent progress to the point where he had all the rendering stuff working and the map editing tool working. He then got a job and put the project on the back burner. Fast forward a few years and I inherited his development laptop he used to build the game. I backed everything up, nuked Windows and installed Debian. About three months ago at work he asked me for that code as he was rekindling the idea of the game with a fellow programmer friend. At first he thought it was a long shot asking for that code, he figured I deleted it or tossed the laptop. I told him I would have a gzip file with his game ready in a few minutes. All I heard was "Fuck yes! Thank you!" He was very happy to find that his 4 or 5 year old code was still kicking around. Dont know if he still plans to continue the game but he and his friend went over the code and got some more ideas.

      • (Score: 1, Interesting) by Anonymous Coward on Tuesday October 14 2014, @01:29AM

        by Anonymous Coward on Tuesday October 14 2014, @01:29AM (#105788)

        I'll see your brother story and raise you one--

        My brother wrote some custom optimization code back in the 1980s for a Fortune 500 company. A typical batch run took 2 hours on a "mainframe" and that was after many painful hours of profiling and other speedup coding. The customer liked it, but around ten years ago they got out of that business completely. We never gave them an exclusive on the source, so we still own it. I saved all the c source and support files, migrating from 8" floppy...to now. My brother died in 2001 after a long illness.

        Now, 30 years later we have an intern who is interested in the same problem...and that old source runs in seconds or minutes these days. Our intern is going to use the old source for the core compute engine behind a phone app. It won't be a million $$ app, but might get us some good press in our niche industry.

  • (Score: 2) by LoRdTAW on Monday October 13 2014, @09:04PM

    by LoRdTAW (3755) on Monday October 13 2014, @09:04PM (#105719) Journal

    You could install one of those humongous document managers. But why deal with a ton of bloat and dependencies? You can roll your own quite easily as long as you just want to store text and only text! Do you really want a database engine, web server, php backend etc. just to find an IRC chat log?

    -Browseable: A must! The value of easily revisiting past communications is immeasurable. .zip'd-type formats may save space, but even years of text-based personal correspondence don't amount to much in comparison to a few music albums or feature-length movies.

    These two don't go together. Save the compression for backups and long term archival. Searching an archive is pretty CPU and disk intensive. That or you build a table in a file that stores the metadata and tags for each compressed file in the archive. Generate that file before archival/compression and include it inside the file. Make sure it has the same name in every archive for simplicity, e.g. content.txt. To search an archive, uncompress just that file, grep it for the tags and then untar the corresponding file(s) you want in the archive. Not impossible but very difficult to work with. A script could take care of the whole mess.

    -Stored in a long-lasting format/encoding: ASCII won't work as Internet-based communication often contains structural elements like links and lists, not to mention RTF-style formatting. HTML seems like a good start.

    HTML might be good but perhaps you might want to consider Markdown: https://en.wikipedia.org/wiki/Markdown [wikipedia.org]. It keeps inline with the simplicity of ASCII but does away with the verbose markup of HTML. You could couple that with Werc for web display if you are looking for simplicity and want to avoid bulky frameworks: http://werc.cat-v.org/ [cat-v.org]. Markdown is also more legible if you need to view the raw file. You can even go the other way too. Just search google for "convert html to markdown" and "convert rtf to markdown". Oh and there is a VIM plugin too :). K.I.S.S.! And avoid postscript and PDF, but I bet you already knew that.

    -Maintains linear structure of discussion threads

    You want the kitchen sink too? Just kidding. This needs to come from your source material. If the source has formatting then it can be preserved.

    -Searchable: Last and definitely least-necessary feature -- 'grep' is always an easy first resort :)

    Well you are a Linux/Vim person so grep and Unix tools are your friends. Using the simplicity of directories to sort and store files, you can also use your file names to store "meta data" and make them grepable. KISS applies here. Or put your tags and meta data in the document header. So a grep or sed/awk script only has to read the first few dozen/hundred bytes or so to find your content. Build a directory structure and simply add content by date. E.g. archive this soylent conversation as ~/webtalk/2014/10/13/ask_SN_about_document_storage.txt. The idea is to remove the need for complex software stacks and use built in tools that will still probably work 50 years from now.

    -Tag-able: If search features are built-in, this is an obviously valuable feature.

    Again, store tags in the file names or file header. Dont be shy about long file names if it helps. Though I can see some people cringing at the idea of very long file names. It is up to you. Might not be so bad to use the filename to store relevant data. Might also make archive searching easier as you simply list the file names in the archive and grep em.

    K.I.S.S. was my fathers favorite expression along with "just do it", even before Nike usurped it in the late 80's. Like I said before, you want to be sure this system works 5, 10, 15, 30 or even 60 years from now. Using directories and unix tools should ensure you are still able to grep your directory tree full of markdown files for decades down the road.

    • (Score: 1) by hendrikboom on Tuesday October 14 2014, @06:57PM

      by hendrikboom (1125) Subscriber Badge on Tuesday October 14 2014, @06:57PM (#106036) Homepage Journal

      The trouble with markdown is that there are a lot of minor variants to it. While the documents will e mostly readable in their ASCII form, there'll be the problem of knowing which markdown processor belongs to which document.

      -- hendrik

  • (Score: 2) by VLM on Monday October 13 2014, @09:13PM

    by VLM (445) on Monday October 13 2014, @09:13PM (#105720)

    First google for "getting things done" and think about implementing it.

    Then look at emacs org mode or maybe Evernote and do the obvious.

    Finally look at it from a hoarding perspective, the time and effort you spend collecting junk must be well under the time and effort required to work around it disappearing. So I've forgotten more about Ebers-Moll and Early effect transistor modeling and hybrid pi vs h parameters than you'll likely ever learn, and thats OK, because most of the time I don't need that anyway and its cheaper for me to google for that again than to pretend I'm Indiana Jones with the worlds only map to the treasure. If you're talking about the one and only lab notebook, you do have to keep track of that, but just random junk from the internet, eh, if you can't find a good space to file it in a minute and decide what to toss out to make the new thing fit, just toss it.

    • (Score: 2) by VLM on Monday October 13 2014, @09:33PM

      by VLM (445) on Monday October 13 2014, @09:33PM (#105727)

      PS this applies to financial junk and exercise junk too. with all due respect to quicken and mint and fitbit, I found that maintaining an archive and system to store all that junk was a bigger PITA than tossing it all.

      I got my electric bill last month, went online and paid it, got a payment confirmation, then deleted it all. I just don't give a F that I can't compare this Septembers electric bill with Sept 2013's electric bill, despite attempts by quicken and mint to convince me.

      Ditto the fitbit thing. Its raining cats and dogs today so no going out walking and I'm just under 2 miles today with plenty of time to go but I really don't give a F and next time it goes thru the clothes washer, I'm all done with technology based "health tracking". The iwatch is too little too late would have been cool at the peak of the fad about two years ago, when it arrives its going to be like getting a new car with a really boss 8-track player.

      • (Score: 0) by Anonymous Coward on Monday October 13 2014, @10:10PM

        by Anonymous Coward on Monday October 13 2014, @10:10PM (#105739)

        I keep utility statements for 12 months - not much space required, and I can prove I paid what they billed me.

        • (Score: 2) by hemocyanin on Tuesday October 14 2014, @12:21AM

          by hemocyanin (186) on Tuesday October 14 2014, @12:21AM (#105778) Journal

          Your bank can print up a statement or cancelled check and if paid by credit card, you can get the payment records there as well.

        • (Score: 2) by urza9814 on Wednesday October 15 2014, @02:32PM

          by urza9814 (3954) on Wednesday October 15 2014, @02:32PM (#106260) Journal

          I keep utility statements for 12 months - not much space required, and I can prove I paid what they billed me.

          Meh, when my landlord used to lose my payments every couple months, I started making sure they gave me receipts...but I still never actually used those receipts. It was easier to just go on my bank's website and print out a copy of the processed check than it was to figure out where the hell I put that receipt...

          Back on the original topic...I used to store EVERYTHING. Every email, every AIM chat log, all of it. The email I still have, simply because it's all been in Gmail since I was 15. The IM logs? Most of them have been lost through various failed backups...but I've realized I couldn't care less. There are even a couple I still have, because I was very careful at the time to back them up separately...and now I go back and look at those conversations and I can't figure out why the hell I thought they were so important. And I don't mean that I don't know why I thought that specific thing would be worth saving -- I don't even know what it was I was trying to save!

          Interestingly, the few things I do occasionally miss are the things that are hardest to back up. Couple conversations from MySpace I wouldn't mind having an archive of, but I don't have a clue how you'd get large amounts of data out of there. Meanwhile the IM logs, which every client stored automatically, are all completely worthless and I have no problem nuking them now. Proprietary networks are gonna be a problem here.

  • (Score: 0) by Anonymous Coward on Monday October 13 2014, @10:57PM

    by Anonymous Coward on Monday October 13 2014, @10:57PM (#105748)

    if you use Linux you shouldn't be zipping files. does anyone still use zip? bz2 that tarball

    • (Score: 1) by pnkwarhall on Tuesday October 14 2014, @01:13AM

      by pnkwarhall (4558) on Tuesday October 14 2014, @01:13AM (#105783)

      .bz2'd doesn't have the same ring to it... neither does 'tar'd', for multiple reasons :)

      --
      Lift Yr Skinny Fists Like Antennas to Heaven