Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 15 submissions in the queue.
posted by janrinok on Friday July 11 2014, @01:07AM   Printer-friendly
from the picking-brains-time dept.

This is probably one of those topics that gets regurgitated periodically, but it's always good to get some fresh answers.

The small consultancy business I work for wants to set up a new file server with remote backup. In the past we have used a Windows XP file server and plugged in a couple of external USB drives when space runs out. Backups were performed nightly to a USB drive and taken offsite to a trusted employees home.

They are looking to Linux for a new file server (I think more because they found out how much a new Windows file server would be).

I'm not a server guy but I have set up a simple Debian-based web server at work for a specific intranet application, but when I was asked about ideas for the new system the best I could come up with was maybe ssh+rsync (which I have only recently started using myself so I'm no expert by any means). Using Amazon's cloud service has been suggested, as well as the remote being a dedicated machine at a trusted employee's home (probably with a new dedicated line in) or with our local ISP (if they can offer such a service). A new dedicated line out of the office has also been suggested, I think mainly because daily file changes can potentially be quite large (3D CAD models etc). A possible advantage of the remote being nearby is that the initial backup could be using a portable hard drive instead of having to uploading terabytes of data (I guess there is always courier services though).

Anyway, just thought I'd chuck it out there. A lot of you guys probably already set up and/or look after remote backup systems. Even if anyone just has some ideas regarding potential traps/pitfalls would be handy. The company is fairly small (about 20-odd employees) so I don't think they need anything overly elaborate, but all feedback is appreciated.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Informative) by AudioGuy on Friday July 11 2014, @03:19AM

    by AudioGuy (24) on Friday July 11 2014, @03:19AM (#67415) Journal

    I work with small businesses in the 5 to 100 employee range.

    This is dependent upon the nature of your data in the specific business, of course, but in general I have found that unless you live in a modern, connected place like Hong Kong, Tokyo, Latvia, Finland etc, it is just practically unworkable for real, modern office data needs. In a backwater like the US with its slow, expensive internet, hopeless.

    Yes, there are incremental backups, etc. But consider what happens in real life: The graphics guy decides to reorganize his folders, 'MyGraphics (200GB)' to 'AllGraphics/Gifs AllGraphics/Jpegs'. To an incremental backup program he just deleted a folder, created new ones, and all those files need to be copied again. Maybe he decided to process them all to remove some metadata in Photoshop at the same time, so even the worlds smartest incremental backup program is going to see those all as brand new files to be copied.

    So 200GB of data is going to be copied to your remote 'cloud' location. Just how fast a connection can your small company afford? Let's say you have a 20Mbs connection like the local school here. 200GBytes is about 2000GBits, 2 000 000 000 000/20 000 000 = 27 hours to transfer what took this guy a few minutes to do. Multiply by 20 people. Your offsite backup will probably never finish.

    Compression? These graphics are mostly already compressed.

    It's of course very dependent upon the amount of data, but I have always been surprised at the amount even from busineses where I would not have expected large data sets.

    Here is what I usually do, seems to work:

    I have one main depository - the file server. This has two large disks, raided with raid 1, so loss of any one drive does not lose data. This runs Samba, which works with both Windows and Macs (and linux :-) ).

    There is a second machine - and note these do not need to be particularly powerful ones, I often use old repurposed user machines - this one I typically call archive. It has two big disks the same size as the main file server, but these are NOT raided.

    All the backups are done with simple shell scripts running rsync. There are several scripts, designed to handle different cases - some data rarely changes and is not critical. Other data may change a lot, and my be so critical its loss could severely hurt the business.

    One script runs weekly, on weekends. It rsyncs the main archive to either disk 'A' or disk 'B' on the archive machine. These alternate weekly. This insures a serious data error on the main drive can probably be fixed by the previous weeks untouched backup (protects agains a lot of 'oops, I didn't mean to delete that and now the backup has run...'). This script gets everything, that is why it runs on weekends, so as not to tie up the internal net during workdays.

    Another script runs daily. This is for fast changing, more critical stuff. (could be hourly, in some cases). Usually there is MUCH less of this, and you know where it lives, so these can pretty safely run overnight. It just updates the main archive backup disk, whichever one is selected that week.

    And this script may call at the end one more, which handles very special data, very critical data. This data is backed up on a separate section (possibly separate partition), and it has folders called 'Today' Yesterday' 'ThisWeek' 'LastWeek' 'ThisMonth' 'LastMonth' and sometimes 'last year, etc. Data is rotated through this, so that it is easy to pull out the data from last month for some emergency etc. You may think this is overkill, but I can say that I have needed to pull data from previous years a number of times. ('remember that proposal so and so sent me last year? Would you still have that backed...')

    Nothing is compressed, it is a pain and just slows access, prevents easy searching, makes recovery harder, and may fail in various ways. Disk space is cheap.

    Usually the 'archive' machine is also used for a mail backup and archive.

    I usually partition the main data store into two parts, one is what most users see as the fileserver, the other is normally not visible to them and contains system backups, such all the critical stuff on servers (/etc dirs if nothing else, but in most cases full backups so a server can be recreated at any time just by a simple disk copy).

    All the rsyncs are 'pulls' to the archive machine, and send emails on completion. If I (or whoever) don't get an email on Monday, I know something is seriously wrong.

    The scripts are really just lists of rsync commands, no fancy programming needed.

    Off site backups? Easily done by swapping out the 'out' drive for the week, let someone take it home. Plugin firewire drives work well for this too. The problem in small businesses is getting people to reliably DO it. So how it REALLY works in real life is this:

    Any time a drive fails, or anyone one panics, thinking they lost something, be sure to ask if that task has been done. This is the most effective time. :-)

    Second component of this: Buy drives that are not necessarily the biggest available. When you have to replace them, because they WILL fill up, just keep the old drives, send them home with someone.

    This is what I have come down to after quite a few years, and different approaches. It is very simple and has not failed me.

    -AG

    Starting Score:    1  point
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 0) by Anonymous Coward on Friday July 11 2014, @03:31AM

    by Anonymous Coward on Friday July 11 2014, @03:31AM (#67416)

    > Yes, there are incremental backups, etc. But consider what happens in
    > real life: The graphics guy decides to reorganize his folders,
    > 'MyGraphics (200GB)' to 'AllGraphics/Gifs AllGraphics/Jpegs'. To an
    > incremental backup program he just deleted a folder, created new ones,
    > and all those files need to be copied again. Maybe he decided to
    > process them all to remove some metadata in Photoshop at the same
    > time, so even the worlds smartest incremental backup program is going
    > to see those all as brand new files to be copied.

    You are behind the times my friend. Modern backup systems use data de-dupulication algorithms to deal with cases like that. Rename the files, move them to different filesystems, tweak some of the headers in the file, it doesn't matter. Only the disk blocks with changed data get copied.

    Here's one program that works like that: Obnam [obnam.org]

    • (Score: 2) by AudioGuy on Friday July 11 2014, @03:42AM

      by AudioGuy (24) on Friday July 11 2014, @03:42AM (#67418) Journal

      But what about the case I mentioned, where the graphics guy did some processing on each file? They are now all different, and Photoshop pretty severely messes with everything.

      And what about the very first backup. 20 machines, full backup. The smallest drive I can FIND anymore is maybe 300GB, and users really DO fill these up.

      'Only the disk blocks with changed data get copied.' He rewrote every file. The disk blocks are all in different locations, very likely.

      • (Score: 1) by CyprusBlue on Friday July 11 2014, @04:21AM

        by CyprusBlue (943) on Friday July 11 2014, @04:21AM (#67431)

        You're arguing without looking up the information. Many different systems these days do indeed handle this gracefully migrating only the changed blocks from the filesystem level.

        Check out ZFS for instance, it's what I use for the most part to solve this. And you're also wrong about compression, it's almost a 0 hit on cpu when done right, for significant savings at times. When combining dedup with compresssion for the actual stream updates, it's really not much of a hit, and can save lots of window time. The bigger issue is generally how do you do full restore windows, as that requires much higher bandwidth than deltas.

        Obviously there are outliers, but those situations (like a production studio for instance) clearly have to be handled differently anyway, and are almost straw men when talking about the general small office case.

        • (Score: 2) by AudioGuy on Friday July 11 2014, @05:11AM

          by AudioGuy (24) on Friday July 11 2014, @05:11AM (#67450) Journal

          I did look it up, (it has some problems with sql data, etc.) and was not unaware of the de-duplication algorithms existence.

          I picked a poor example on how incrementals can be fooled. The real point was meant to be simply that large amounts of data can change in ways where you would not expect this to be the case. I think the color change mentioned below would have been a better choice.

          It is possible my experience is slightly skewed by many of the businesses I deal with being involved in the arts.

          However, the original poster did specifically mention 'because daily file changes can potentially be quite large (3D CAD models etc)'. To me that means 'many gigabytes of data every day' - NEW data. What would be more useful is if he were to mention what the typical amounts actually were.

          If the 3D Cad he is talking about is the kind used for say, video/movie production, just a simple, slight color change will rewrite the whole file, pretty much every byte of the rgb data, and that file could easily be 20-100 GB. He hasn't said, so I don't know.

          But even other companies surprise me - they have huge print files, they are generating simple video, editing it, color correcting, etc.

          It adds up, and while deduplication sure looks like a useful tool I have my doubts it is -enough- to compensate for the woefully inadequate internet speeds many of us have to deal with. Maybe in Finland it is enough. :-)

          I don't understand the comment about compression (sorry, replying to two different posters at once, probably I shouldn't), I only mentioned that I do not compress the files on disk on the local copy. This just makes it simpler and faster for others to find files in the archive. If I were transferring general files over the net I would certainly want it. It doesn't help a whole lot on the case of already compressed files like jpegs and much video. I said nothing about stressing the processor.

          Most of the small businesses I work with have several terabytes of data to back up, so that initial backup would take quite some time. You can dismiss that, but I can't. :-)

      • (Score: 0) by Anonymous Coward on Friday July 11 2014, @04:27AM

        by Anonymous Coward on Friday July 11 2014, @04:27AM (#67432)

        > And what about the very first backup. 20 machines, full backup.

        I wasn't addressing the issue of level-zeros, I was simply pointing out that your claims about how incremental backups work are obsolete.

        > 'Only the disk blocks with changed data get copied.'
        > He rewrote every file. The disk blocks are all in different locations, very likely.

        If you are trying to say that the data offsets within each disk block changed because the file structures aren't block-aligned, well sure that's always a risk. There will always be pathological cases. But designing a system based on the rare pathological case brings its own risks - you identified one yourself when you pointed out how hard it is to get regular people to haul a disk offsite.

        Like everything in life, it's a series of trade-offs. But you can't make an accurate assessment of the trade-offs if you aren't starting with a realistic evaluation of the available options.

    • (Score: 0) by Anonymous Coward on Friday July 11 2014, @11:41PM

      by Anonymous Coward on Friday July 11 2014, @11:41PM (#67896)

      unless you live in a modern, connected place like [...] Latvia [...]

      ...

      ....

      Bwahahahahaha, LOL, LOL, ROFL, hahahahaha, YEAH, LMAO, mwahahahahahahahahahaha. You made my day, thanks!

  • (Score: 2) by egcagrac0 on Friday July 11 2014, @03:22PM

    by egcagrac0 (2705) on Friday July 11 2014, @03:22PM (#67625)

    I often use old repurposed user machines

    One script runs weekly, on weekends

    Another script runs daily

    one more, which handles very special data, very critical data. This data is backed up on a separate section (possibly separate partition), and it has folders called 'Today' Yesterday' 'ThisWeek' 'LastWeek' 'ThisMonth' 'LastMonth' and

    The scripts are really just lists of rsync commands, no fancy programming needed.

    It is very simple and has not failed me.

    I'm cringing at a lot of that. I'm not exactly sure what you're doing, but it doesn't sound like a backup to me. It sounds like a copy. The two are not the same.

    A copy might get you "something useful" when SHTF. It's a damn sight better than nothing.

    A backup gets RPO, RTO, and some version history. A backup gets tested.

    Trying to get there with someone's old desktop and three hard drives... hopefully your customers understand in advance the risks inherent, and their business continuity plan aligns acceptably with what you're doing.

    A small business of about that size that I was formerly affiliated with had a consultant who thought up a similar scheme. After a junior executive (with the same last name as the owner) "reorganized"* the file server to "clean up" what were apparently needed documents, we discovered that even though we were paying a third party for a backup, those backups were unrecoverable.

    *When we reminded him two weeks later, when he did it again, that we had said "backups aren't working, don't go cleaning up any more files until we say otherwise", his reply was that he had to be able to delete needed files without consequences... backups were a nightmare at that place, since they didn't want to spend any money on what was apparently business critical stuff.