pigz –rsyncable, rdiff

i’m backing up in total ~90GB of mysqldumps each night. the more data, the bigger pain it is.

mysqldump

my original setup used unpacked output of:

mysqldump --defaults-file=/etc/mysql/debian.cnf --quick --skip-lock-tables --single-transaction  --flush-logs --hex-blob  --master-data=2  -A --skip-extended-insert

then archived with rdiff-backup. some more details here. i do know that backup produced in this way – with skip-extended-insert – is larger and takes more time to restore, more about it later.

recently i’ve found out that gzip / pigz has option –rsyncable which produces slightly larger archive that is more ‘friendly’ for the rsync algorithm [also used by rdiff-backup]. so i decided to compare size of rdiff diffs for uncompressed mysql dump and one that is compressed with –rsyncable option to see if i can gain anything from the change.

in my case using the best compression method [ -9 ] is not worth it. 25GB file compressed to 10474588077B with -9 compression took in avg 5m22s to finish. the same file compressed with the default options was 20MB larger – 10494193501B, compression took 5m7s. i took average time from 3 pigz runs for both of methods; for most of the time process was cpu bound.

25GB file compressed to 10474588077B with -9 compression option in, with default compression method.

information about dumps from particular dates, their sizes and sizes of diffs generated by rdiff:

input size rdiff increment
date raw sql bz2 gz rsyncable.gz raw sql bz2 gz rsyncable.gz
12           26 804 890 664             8 811 112 277             10 329 919 622           10 484 961 505
13           26 808 113 131             8 811 291 622             10 331 311 920           10 486 382 591          3 716 517 795          8 812 764 367          10 331 856 487          3 824 695 504
14           26 818 029 468             8 814 098 013             10 334 963 258           10 490 095 367          3 224 401 389          8 812 943 745          10 333 249 046          3 403 160 952
15           26 818 132 819             8 814 100 947             10 334 998 479           10 490 120 905                51 622 629          8 814 248 439          10 334 867 739                56 785 799
16           26 823 520 086             8 816 488 488             10 337 628 727           10 492 799 195          2 194 848 425          8 815 753 598          10 336 834 625          2 355 854 983
17           26 831 125 141             8 819 823 312             10 341 419 832           10 496 643 886          6 618 091 371          8 818 141 586          10 339 567 038          4 293 817 628
18           26 795 688 890             8 806 741 963             10 325 784 290           10 480 737 096          3 690 809 910          8 821 477 034          10 343 358 854          3 800 270 362
19           26 797 195 049             8 806 555 214             10 326 417 028           10 481 487 569          3 822 050 888          8 808 393 234          10 327 720 381          3 928 010 024
20           26 803 479 158             8 810 942 424             10 329 400 507           10 484 489 179          3 812 420 943          8 808 206 449          10 328 353 239          3 931 321 186
21           26 834 035 279             8 818 676 083             10 338 926 030           10 494 198 724          3 660 097 166          8 812 594 481          10 331 337 276          3 938 756 318
22           26 834 013 616             8 818 614 552             10 338 914 292           10 494 193 501                51 043 642          7 389 340 006             8 821 785 673                61 001 180

total size of the rdiff archive [latest version+increments]:

raw sql bz2 gz rsyncable.gz
          57 675 940 544          95 532 500 216          112 167 867 459           40 087 890 366

so in case of my data gziping the mysqldump first with pigz –rsyncable makes most sense from the final backup size point of view. using regular gzip or bz2 leads to output files having too much differences between each dump, leading to very large diffs produced by the rsync algorithm of rdiff.

as mentioned earlier we use the skip-extended-insert option for mysqldump and i’m somewhat torn weather to use it or not. Pros of using skip-extended-insert:

  • backups done with skip-extended-insert produce smaller rdiff diffs [ for uncompressed dumps it’s ~3.7GB instead of ~5.1GB, similarly for gzip’ed backups with the rsyncable switch ].
  • backups taken in this way are easier to grep; in our case restoring single sql row is much more common than recovering the whole database

cons:

  • recovery time is the biggest downside. in my tests i can recover 26GB backup taken with skip-extended-insert option in ~90 minutes; backup without skip-extended-insert takes only 22min to restore.

vmware backups taken with ghettovcb

we use ghettoVCB to take weekly snapshots of vms running under vmware esxi. backups can be large. so far i’ve been pbzip2’ing them and using rdiff to keep current and single previous version. i’ve done some test on a randomly selected snapshot of windows 2012 server vm and compared the sizes:

input size diff size
raw bz2 rsyncable.gz raw bz2 rsyncable.gz
17          35 170 054 000           14 820 111 000           15 427 020 000
24          39 063 046 000                     15 484 366 000           16 193 240 000           8 049 706 000           14 810 731 000           11 461 074 000

and the total size of rdiff archive:

raw bz2 rsyncable.gz
         47 112 752 000                     30 295 097 000           27 654 314 000

so also here pigz with –rsyncable option seems to be the winner. in a while i should take a look at xz.

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)