The Non-Beginner’s Guide to Syncing Data with Rsync
The rsync protocol can be pretty simple to use for ordinary backup/synchronization jobs, but some of its more advanced features may surprise you. In this article, we’re going to show how even the biggest data hoarders and backup enthusiasts can wield rsync as a single solution for all of their data redundancy needs.
Warning: Advanced Geeks Only
If you’re sitting there thinking “What the heck is rsync?” or “I only use rsync for really simple tasks,” you may want to check out our previous article on how to use rsync to backup your data on Linux, which gives an introduction to rsync, guides you through installation, and showcases its more basic functions. Once you have a firm grasp of how to use rsync (honestly, it isn’t that complex) and are comfortable with a Linux terminal, you’re ready to move on to this advanced guide.
Running rsync on Windows
First, let’s get our Windows readers on the same page as our Linux gurus. Although rsync is built to run on Unix-like systems, there’s no reason that you shouldn’t be able to use it just as easily on Windows. Cygwin produces a wonderful Linux API that we can use to run rsync, so head over to their website and download the 32-bit or 64-bit version, depending on your computer.
Installation is straightforward; you can keep all options at their default values until you get to the “Select Packages” screen.
Now you need to do the same steps for Vim and SSH, but the packages are going to look a bit different when you go to select them, so here are some screenshots:
Installing Vim:
Installing SSH:
After you’ve selected those three packages, keep clicking next until you finish the installation. Then you can open Cygwin by clicking on the icon that the installer placed on your desktop.
rsync Commands: Simple to Advanced
Now that the Windows users are on the same page, let’s take a look at a simple rsync command, and show how the use of some advanced switches can quickly make it complex.
Let’s say you have a bunch of files that need backed up – who doesn’t these days? You plug in your portable hard drive so you can backup your computers files, and issue the following command:
rsync -a /home/geek/files/ /mnt/usb/files/
Or, the way it would look on a Windows computer with Cygwin:
rsync -a /cygdrive/c/files/ /cygdrive/e/files/
Pretty simple, and at that point there’s really no need to use rsync, since you could just drag and drop the files. However, if your other hard drive already has some of the files and just needs the updated versions plus the files that have been created since the last sync, this command is handy because it only sends the new data over to the hard drive. With big files, and especially transferring files over the internet, that is a big deal.
Backing up your files to an external hard drive and then keeping the hard drive in the same location as your computer is a very bad idea, so let’s take a look at what it would require to start sending your files over the internet to another computer (one you’ve rented, a family member’s, etc).
rsync -av --delete -e 'ssh -p 12345’ /home/geek/files/ geek2@10.1.1.1:/home/geek2/files/
The above command would send your files to another computer with an IP address of 10.1.1.1. It would delete extraneous files from the destination that no longer exist in the source directory, output the filenames being transferred so you have an idea of what’s going on, and tunnel rsync through SSH on port 12345.
The -a -v -e --delete
switches are some of the most basic and commonly used; you should already know a good deal about them if you’re reading this tutorial. Let’s go over some other switches that are sometimes ignored but incredibly useful:
--progress
– This switch allows us to see the transfer progress of each file. It’s particularly useful when transferring large files over the internet, but can output a senseless amount of information when just transferring small files across a fast network.
An rsync command with the --progress
switch as a backup is in progress:
--partial
– This is another switch that is particularly useful when transferring large files over the internet. If rsync gets interrupted for any reason in the middle of a file transfer, the partially transferred file is kept in the destination directory and the transfer is resumed where it left off once the rsync command is executed again. When transferring large files over the internet (say, a couple of gigabytes), there’s nothing worse than having a few second internet outage, blue screen, or human error trip up your file transfer and having to start all over again.
-P
– this switch combines --progress
and --partial
, so use it instead and it will make your rsync command a little neater.
-z
or --compress
– This switch will make rsync compress file data as it’s being transferred, reducing the amount of data that has to be sent to the destination. It’s actually a fairly common switch but is far from essential, only really benefiting you on transfers between slow connections, and it does nothing for the following types of files: 7z, avi, bz2, deb, g,z iso, jpeg, jpg, mov, mp3, mp4, ogg, rpm, tbz, tgz, z, zip.
-h
or --human-readable
– If you’re using the --progress
switch, you’ll definitely want to use this one as well. That is, unless you like to convert bytes to megabytes on the fly. The -h
switch converts all outputted numbers to human-readable format, so you can actually make sense of the amount of data being transferred.
-n
or --dry-run
– This switch is essential to know when you’re first writing your rsync script and testing it out. It performs a trial run but doesn’t actually make any changes – the would-be changes are still outputted as normal, so you can read over everything and make sure it looks okay before rolling your script into production.
-R
or --relative
– This switch must be used if the destination directory doesn’t already exist. We will use this option later in this guide so that we can make directories on the target machine with timestamps in the folder names.
--exclude-from
– This switch is used to link to an exclude list that contains directory paths that you don’t want backed up. It just needs a plain text file with a directory or file path on each line.
--include-from
– Similar to --exclude-from
, but it links to a file that contains directories and file paths of data you want backed up.
--stats
– Not really an important switch by any means, but if you are a sysadmin, it can be handy to know the detailed stats of each backup, just so you can monitor the amount of traffic being sent over your network and such.
--log-file
– This lets you send the rsync output to a log file. We definitely recommend this for automated backups in which you aren’t there to read through the output yourself. Always give log files a once over in your spare time to make sure everything is working properly. Also, it’s a crucial switch for a sysadmin to use, so you’re not left wondering how your backups failed while you left the intern in charge.
Let’s take a look at our rsync command now that we have a few more switches added:
rsync -avzhP --delete --stats --log-file=/home/geek/rsynclogs/backup.log --exclude-from '/home/geek/exclude.txt' -e 'ssh -p 12345' /home/geek/files/ geek2@10.1.1.1:/home/geek2/files/
The command is still pretty simple, but we still haven’t created a decent backup solution. Even though our files are now in two different physical locations, this backup does nothing to protect us from one of the main causes of data loss: human error.
Snapshot Backups
If you accidentally delete a file, a virus corrupts any of your files, or something else happens whereby your files are undesirably altered, and then you run your rsync backup script, your backed up data is overwritten with the undesirable changes. When such a thing occurs (not if, but when), your backup solution did nothing to protect you from your data loss.
The creator of rsync realized this, and added the --backup
and --backup-dir
arguments so users could run differential backups. The very first example on rsync’s website shows a script where a full backup is run every seven days, and then the changes to those files are backed up in separate directories daily. The problem with this method is that to recover your files, you have to effectively recover them seven different times. Moreover, most geeks run their backups several times a day, so you could easily have 20+ different backup directories at any given time. Not only is recovering your files now a pain, but even just looking through your backed up data can be extremely time consuming – you’d have to know the last time a file was changed in order to find its most recent backed up copy. On top of all that, it’s inefficient to run only weekly (or even less often in some cases) incremental backups.
Snapshot backups to the rescue! Snapshot backups are nothing more than incremental backups, but they utilize hardlinks to retain the file structure of the original source. That may be hard to wrap your head around at first, so let’s take a look at an example.
Pretend we have a backup script running that automatically backs up our data every two hours. Whenever rsync does this, it names each backup in the format of: Backup-month-day-year-time.
So, at the end a typical day, we’d have a list of folders in our destination directory like this:
When traversing any of those directories, you’d see every file from the source directory exactly as it was at that time. Yet, there would be no duplicates across any two directories. rsync accomplishes this with the use of hardlinking through the --link-dest=DIR
argument.
Of course, in order to have these nicely- and neatly-dated directory names, we’re going to have to beef up our rsync script a bit. Let’s take a look at what it would take to accomplish a backup solution like this, and then we’ll explain the script in greater detail:
#!/bin/bash
#copy old time.txt to time2.txt
yes | cp ~/backup/time.txt ~/backup/time2.txt
#overwrite old time.txt file with new time
echo `date +”%F-%I%p”` > ~/backup/time.txt
#make the log file
echo “” > ~/backup/rsync-`date +”%F-%I%p”`.log
#rsync command
rsync -avzhPR --chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r --delete --stats --log-file=~/backup/rsync-`date +”%F-%I%p”`.log --exclude-from '~/exclude.txt' --link-dest=/home/geek2/files/`cat ~/backup/time2.txt` -e 'ssh -p 12345' /home/geek/files/ geek2@10.1.1.1:/home/geek2/files/`date +”%F-%I%p”`/
#don’t forget to scp the log file and put it with the backup
scp -P 12345 ~/backup/rsync-`cat ~/backup/time.txt`.log geek2@10.1.1.1:/home/geek2/files/`cat ~/backup/time.txt`/rsync-`cat ~/backup/time.txt`.log
That would be a typical snapshot rsync script. In case we lost you somewhere, let’s dissect it piece by piece:
The first line of our script copies the contents of time.txt to time2.txt. The yes pipe is to confirm that we want to overwrite the file. Next, we take the current time and put it into time.txt. These files will come in handy later.
The next line makes the rsync log file, naming it rsync-date.log (where date is the actual date and time).
Now, the complex rsync command that we’ve been warning you about:
-avzhPR, -e, --delete, --stats, --log-file, --exclude-from, --link-dest
– Just the switches we talked about earlier; scroll up if you need a refresher.
--chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r
– These are the permissions for the destination directory. Since we are making this directory in the middle of our rsync script, we need to specify the permissions so that our user can write files to it.
The use of date and cat commands
We’re going to go over each use of the date and cat commands inside the rsync command, in the order that they occur. Note: we’re aware that there are other ways to accomplish this functionality, especially with the use of declaring variables, but for the purpose of this guide, we’ve decided to use this method.
The log file is specified as:
~/backup/rsync-`date +”%F-%I%p”`.log
Alternatively, we could have specified it as:
~/backup/rsync-`cat ~/backup/time.txt`.log
Either way, the --log-file
command should be able to find the previously created dated log file and write to it.
The link destination file is specified as:
--link-dest=/home/geek2/files/`cat ~/backup/time2.txt`
This means that the --link-dest
command is given the directory of the previous backup. If we are running backups every two hours, and it’s 4:00PM at the time we ran this script, then the --link-dest
command looks for the directory created at 2:00PM and only transfers the data that has changed since then (if any).
To reiterate, that is why time.txt is copied to time2.txt at the beginning of the script, so the --link-dest
command can reference that time later.
The destination directory is specified as:
geek2@10.1.1.1:/home/geek2/files/`date +”%F-%I%p”`
This command simply puts the source files into a directory that has a title of the current date and time.
Finally, we make sure that a copy of the log file is placed inside the backup.
scp -P 12345 ~/backup/rsync-`cat ~/backup/time.txt`.log geek2@10.1.1.1:/home/geek2/files/`cat ~/backup/time.txt`/rsync-`cat ~/backup/time.txt`.log
We use secure copy on port 12345 to take the rsync log and place it in the proper directory. To select the correct log file and make sure it ends up in the right spot, the time.txt file must be referenced via the cat command. If you’re wondering why we decided to cat time.txt instead of just using the date command, it’s because a lot of time could have transpired while the rsync command was running, so to make sure we have the right time, we just cat the text document we created earlier.
Automation
Use Cron on Linux or Task Scheduler on Windows to automate your rsync script. One thing you have to be careful of is making sure that you end any currently running rsync processes before continuing a new one. Task Scheduler seems to close any already running instances automatically, but for Linux you’ll need to be a little more creative.
Most Linux distributions can use the pkill command, so just be sure to add the following to the beginning of your rsync script:
pkill -9 rsync
Encryption
Nope, we’re not done yet. We finally have a fantastic (and free!) backup solution in place, but all of our files are still susceptible to theft. Hopefully, you’re backing up your files to some place hundreds of miles away. No matter how secure that faraway place is, theft and hacking can always be problems.
In our examples, we have tunneled all of our rsync traffic through SSH, so that means all of our files are encrypted while in transit to their destination. However, we need to make sure the destination is just as secure. Keep in mind that rsync only encrypts your data as it is being transferred, but the files are wide open once they reach their destination.
One of rsync’s best features is that it only transfers the changes in each file. If you have all of your files encrypted and make one minor change, the entire file will have to be retransmitted as a result of the encryption completely randomizing all of the data after any change.
For this reason, it’s best/easiest to use some type of disk encryption, such as BitLocker for Windows or dm-crypt for Linux. That way, your data is protected in the event of theft, but files can be transferred with rsync and your encryption won’t hinder its performance. There are other options available that work similarly to rsync or even implement some form of it, such as Duplicity, but they lack some of the features that rsync has to offer.
After you’ve setup your snapshot backups at an offsite location and encrypted your source and destination hard drives, give yourself a pat on the back for mastering rsync and implementing the most foolproof data backup solution possible.