Ideas: rsync-like tool parallelizing transfer of large files

i use rsync heavily for backups, batch jobs and one-off tasks. every day it moves around terabytes of data for me. while it’s great – saves time and bandwidth, it could make even better use of modern hardware – multicore, and often with plenty of unused storage IOps:

  1. rsync uses single CPU core on both sender and recipient side to calculate rolling checksum and handle sending only information that has actually changed. when file already exists at the destination and there are only minor changes between source and destination – it’ll often be constrained by single CPU core performance and not by network or storage. on 10GBit/s LAN rsync might slow down to few MB/s crawl when replicating image of VM or dump of a database,
  2. similarly when handling a large transfer with millions of files – rsync will work on them sequentially, transfer data via single TCP stream; even with BBR enabled on both sides often it is far from using available network bandwidth. this can be solved by a wrapper script.

one day, when i’ll have some spare time, i’d love to create a tool that addresses at least the 1st pain point:

  • break up a large file into multiple chunks,
  • start few parallel processes, each processing subset of the chunks,
  • at the destination – work on the destination file in parallel [ linux allows multiple processes to write to the same file at different offsets ],
  • use algorithm similar or like rsync’s rolling hash to send via the network only information that’s actually missing at the destination,
  • while we are at it – use zstd to efficiently compress the traffic,
  • at the end – calculate checksum of both source and destination to ensure that nothing was messed up; or even run actual rsync -c – to get a peace of mind

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)