This is probably one of those topics that gets regurgitated periodically, but it's always good to get some fresh answers.
The small consultancy business I work for wants to set up a new file server with remote backup. In the past we have used a Windows XP file server and plugged in a couple of external USB drives when space runs out. Backups were performed nightly to a USB drive and taken offsite to a trusted employees home.
They are looking to Linux for a new file server (I think more because they found out how much a new Windows file server would be).
I'm not a server guy but I have set up a simple Debian-based web server at work for a specific intranet application, but when I was asked about ideas for the new system the best I could come up with was maybe ssh+rsync (which I have only recently started using myself so I'm no expert by any means). Using Amazon's cloud service has been suggested, as well as the remote being a dedicated machine at a trusted employee's home (probably with a new dedicated line in) or with our local ISP (if they can offer such a service). A new dedicated line out of the office has also been suggested, I think mainly because daily file changes can potentially be quite large (3D CAD models etc). A possible advantage of the remote being nearby is that the initial backup could be using a portable hard drive instead of having to uploading terabytes of data (I guess there is always courier services though).
Anyway, just thought I'd chuck it out there. A lot of you guys probably already set up and/or look after remote backup systems. Even if anyone just has some ideas regarding potential traps/pitfalls would be handy. The company is fairly small (about 20-odd employees) so I don't think they need anything overly elaborate, but all feedback is appreciated.
(Score: 0) by Anonymous Coward on Friday July 11 2014, @03:31AM
> Yes, there are incremental backups, etc. But consider what happens in
> real life: The graphics guy decides to reorganize his folders,
> 'MyGraphics (200GB)' to 'AllGraphics/Gifs AllGraphics/Jpegs'. To an
> incremental backup program he just deleted a folder, created new ones,
> and all those files need to be copied again. Maybe he decided to
> process them all to remove some metadata in Photoshop at the same
> time, so even the worlds smartest incremental backup program is going
> to see those all as brand new files to be copied.
You are behind the times my friend. Modern backup systems use data de-dupulication algorithms to deal with cases like that. Rename the files, move them to different filesystems, tweak some of the headers in the file, it doesn't matter. Only the disk blocks with changed data get copied.
Here's one program that works like that: Obnam [obnam.org]
(Score: 2) by AudioGuy on Friday July 11 2014, @03:42AM
But what about the case I mentioned, where the graphics guy did some processing on each file? They are now all different, and Photoshop pretty severely messes with everything.
And what about the very first backup. 20 machines, full backup. The smallest drive I can FIND anymore is maybe 300GB, and users really DO fill these up.
'Only the disk blocks with changed data get copied.' He rewrote every file. The disk blocks are all in different locations, very likely.
(Score: 1) by CyprusBlue on Friday July 11 2014, @04:21AM
You're arguing without looking up the information. Many different systems these days do indeed handle this gracefully migrating only the changed blocks from the filesystem level.
Check out ZFS for instance, it's what I use for the most part to solve this. And you're also wrong about compression, it's almost a 0 hit on cpu when done right, for significant savings at times. When combining dedup with compresssion for the actual stream updates, it's really not much of a hit, and can save lots of window time. The bigger issue is generally how do you do full restore windows, as that requires much higher bandwidth than deltas.
Obviously there are outliers, but those situations (like a production studio for instance) clearly have to be handled differently anyway, and are almost straw men when talking about the general small office case.
(Score: 2) by AudioGuy on Friday July 11 2014, @05:11AM
I did look it up, (it has some problems with sql data, etc.) and was not unaware of the de-duplication algorithms existence.
I picked a poor example on how incrementals can be fooled. The real point was meant to be simply that large amounts of data can change in ways where you would not expect this to be the case. I think the color change mentioned below would have been a better choice.
It is possible my experience is slightly skewed by many of the businesses I deal with being involved in the arts.
However, the original poster did specifically mention 'because daily file changes can potentially be quite large (3D CAD models etc)'. To me that means 'many gigabytes of data every day' - NEW data. What would be more useful is if he were to mention what the typical amounts actually were.
If the 3D Cad he is talking about is the kind used for say, video/movie production, just a simple, slight color change will rewrite the whole file, pretty much every byte of the rgb data, and that file could easily be 20-100 GB. He hasn't said, so I don't know.
But even other companies surprise me - they have huge print files, they are generating simple video, editing it, color correcting, etc.
It adds up, and while deduplication sure looks like a useful tool I have my doubts it is -enough- to compensate for the woefully inadequate internet speeds many of us have to deal with. Maybe in Finland it is enough. :-)
I don't understand the comment about compression (sorry, replying to two different posters at once, probably I shouldn't), I only mentioned that I do not compress the files on disk on the local copy. This just makes it simpler and faster for others to find files in the archive. If I were transferring general files over the net I would certainly want it. It doesn't help a whole lot on the case of already compressed files like jpegs and much video. I said nothing about stressing the processor.
Most of the small businesses I work with have several terabytes of data to back up, so that initial backup would take quite some time. You can dismiss that, but I can't. :-)
(Score: 0) by Anonymous Coward on Friday July 11 2014, @04:27AM
> And what about the very first backup. 20 machines, full backup.
I wasn't addressing the issue of level-zeros, I was simply pointing out that your claims about how incremental backups work are obsolete.
> 'Only the disk blocks with changed data get copied.'
> He rewrote every file. The disk blocks are all in different locations, very likely.
If you are trying to say that the data offsets within each disk block changed because the file structures aren't block-aligned, well sure that's always a risk. There will always be pathological cases. But designing a system based on the rare pathological case brings its own risks - you identified one yourself when you pointed out how hard it is to get regular people to haul a disk offsite.
Like everything in life, it's a series of trade-offs. But you can't make an accurate assessment of the trade-offs if you aren't starting with a realistic evaluation of the available options.
(Score: 0) by Anonymous Coward on Friday July 11 2014, @11:41PM
...
....
Bwahahahahaha, LOL, LOL, ROFL, hahahahaha, YEAH, LMAO, mwahahahahahahahahahaha. You made my day, thanks!