rsync remains my main tool for transferring backups or just moving data between servers. but it has some pain points – e.g. rsync’s checksum calculation or ssh over which data is piped can easily saturate single CPU core before i run out of storage I/O or network bandwidth.
how to parallelize it – based on this blog post:
#!/bin/bash -e
if [ $# -ne 2 ] ; then
echo "syntax: parallel-rsync.sh src dst"
exit 2
fi
tmpdir=$(mktemp -d /tmp/parallel-rsync.XXXXXXXXXXX)
echo "using $tmpdir"
rsync --archive --verbose --partial --progress --dry-run --itemize-changes "$1" "$2"|grep -E '^<' |cut -d" " -f2 | split - -n r/8 "$tmpdir/list"
ls "$tmpdir/list"* | parallel --lb -t -j 8 rsync --archive --verbose --partial --progress --files-from {} "$1" "$2"
rsync --archive --verbose --progress "$1" "$2"
rm -rf "$tmpdir"