This LWN article is perhaps a bit confused, but it does point out a truth: On modern hardware, with fast, local storage, and large, often static files, rsync is often unncessarily slow.

That's probably true to some extent when using rsync across a LAN, but it's especially true when using it to copy files locally. rsync runs as a client/server pair, and both processes MD5 checksum the files as they are being transferred. That's nice, but it's slow too.

Funny thing about rsync is that probably 50% of uses of it don't involve the core feature that it was written to provide: Updating files by transferring differences. Which, other than ensuring data validity, is the only reason it needs slow checksums. Instead, rsync is often chosen because of all the other awesome features that were glommed onto it over the past several decades.

Compared with rsync, cp is pretty laughable -- it can't even exclude files from being copied by patterns. And unlike rsync, cp command lines are not often developed by repeated trial and error -- cp does not recover well from being ctrl-c'd in the middle, while rsync does. These kinds of things make a lot of us reach for rsync first, even if the situation does not involve incremental file changes. In most any situation, one of rsync's 120-some options is sure to be just what you need...

So lots of scripts use rsync to synchronise directories, but the amount of speedup obtained by using the rsync algorithm is often low. (Correction: rsync turns off the delta-transfer algorythm by default for local to local transfers. It still does md5 checksums however.) Since rsync is reasonably fast, we generally don't care that it does these checksums that probably on average slow it down. But if larger files, like videos, are involved, this starts to change.. When rsync is run on a typical home NAS, with a slow (arm) CPU, the picture changes entirely -- now the checksum overhead is unbearable, while the IO overhead is minimal.

So, here's local-rsync. It takes all the same options as rsync, with the caveat that SOURCE and DEST must be the first two options, and must be local directories.

% local-rsync huge/ /media/usb/huge/ -a -v --exclude '*~' --delete --max-delete=100

It speeds up rsync in these types of situations, by querying it to find what files need to be updated, and updating them the brute force way, with cp. At the end, rsync is run, to take care of the non-brute force stuff (like deletions and file permissions).

On a fast CPU, local-rsync will probably speed up rsync of large, static files by a factor of 2. On a slow CPU, local-rsync is so fast, and rsync so slow, that I have not bothered to benchmark. :)

This is just a hack (easily replaced by a --no-checksum option in rsync, of course). But I think it illustrates some interesting things about how a program's underlying assumptions about its environment can change over time, and how free software programs can accrete value until the original differentiating reason for their existence is not the most important thing about them anymore.

I'll probably use that option next time I can't stand a 10G file any longer
Comment by Kete late Wednesday night, August 19th, 2010
Unless you pass --checksum, rsync looks at files' size+mtime to determine if they need to be updated, and only then does checksumming to transfer the differences. I am having difficulties seeing how your stdio/shell/1-cp-process-spawned-per-file solution should be any faster.
Comment by madduck in the wee hours of Wednesday night, August 19th, 2010
Oh, are you saying that it's faster on slow CPUs to brute-force cp the whole thing, rather than to bother finding the differences? If so, then I am sure this could be patched into rsync. Otoh, I find that even on my NAS hardware, IO is the bottleneck, not the CPU. This is especially true for the Thecus N2100…
Comment by madduck in the wee hours of Wednesday night, August 19th, 2010
Does the rsync -W option do the same thing as brute-forcing cp?
Comment by Lars Wirzenius in the wee hours of Wednesday night, August 19th, 2010

Rsync always checksums. Please see the man page, if you don't believe me:

              Note that rsync always verifies that each transferred  file  was
              correctly  reconstructed  on  the  receiving  side by checking a
              whole-file checksum that is generated  as  the  file  is  trans‐
              ferred

That checksum is a MD5sum. There is also a second, rolling checksum used by the rsync algorithm. Apparently it does both.

And, there is a third one that has to be enabled with --checksum to better detect if an existing file has been changed. But that one is not really relevant.

Comment by joey in the wee hours of Wednesday night, August 19th, 2010

Liw: I think -W might be the magic option I was looking for! Hidden amoung the hundred or so other magic options. :)

Madduck: Actually, I've been doing all my testing on a N2100. Although disk writes have been going to a USB disk. Still, rsync with checksumming is much much slower than just blasting the bits.

Comment by joey in the wee hours of Wednesday night, August 19th, 2010

Regarding -W, the man page says it's the default for local paths. Since rsync is in my experience still cpu-bound on local paths, I think -W must not be disabling all the checksums. Probably rsync is still doing the md5sum that it uses as a whole-file consistency check. -W may disable the rolling checksum only.

A code-dive is in order..

Comment by joey in the wee hours of Wednesday night, August 19th, 2010

There are a number of problems with your script. For example, it does break --backup. Since you're deleting/overwriting the target file with the brute-force cp first, it will be lost and cannot be backupped anymore. Emulating this is next to impossible. Also, in the --dry-run line, adding slashes to src arguments alters rsync behaviour (these could easily be taken away). Also if the source or target are actually files (as opposed to dirs) the script breaks, while rsync alone does not.

Nice idea, but better don't rely on this script and take a code dive instead :-)

Comment by jensstimpfle [myopenid.com] late Wednesday evening, August 25th, 2010

@madduck is there a way to make rsync behave like that? I'm backing up my media library and sometimes only one song's tag changes, it has to checksum all files and as mentioned in the other article read all 80 gigs of my library to do so. On a USB drive through SSH. Yeah. Any ideas?

Comment by Michael Wednesday night, September 1st, 2010