ghettoVCB failing randomly on larger VMs

at work we’re using happily ghettoVCB.sh to back up and restore VMWare ESXi VMs. since a few weeks we’ve started to experience occasional failures of backups, only for one – larger VMs.

in the logs produced in /tmp/ghettoVCB-2017-04-xxx.log we got:

Option --adaptertype is deprecated and hence will be ignored Destination disk format: sparse with 2GB maximum extent size Cloning disk '/vmfs/volumes/datastore1/serverName/server.vmdk'...
Clone: 18% done.Failed to clone disk: Failed to lock the file (16392).

or

Option --adaptertype is deprecated and hence will be ignored Destination disk format: sparse with 2GB maximum extent size Cloning disk '/vmfs/volumes/datastore1/serverName/server.vmdk'...
Clone: 79% done.Failed to clone disk: Connection timed out (7208969).

after some head scratching, watching at iostat -x 1 and ifstat -b 1 -i eth0 on the nefs server that was receiving the file transfer done by the ghettoVCB i’ve realized this:
the receiving linux box would first accept all of the data sent over the network without flushing it early to the disk, then – once the RAM was filled up – it’d slow down and eventually stop accepting more NFS writes while writing down the data from RAM to disk. in the meanwhile vmware’s disk cloneing operation would time out.

stupid me… i was earlier playing with /proc/sys/vm/dirty_expire_centisecs and left it at a very high level of 300000. i’ve also ‘tuned’ dirty_ratio and dirty_background_ratio in a naive hope of limiting disk IO caused by some backups that would involve receiving the data over NFS, compressing it and then running rdiff-backup on the bzip2’ed files.

after reverting values for those three parameters my problems are gone.

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)