so.. i have couple of master machines and few slaves. some of slaves are offsite – 60ms away from primary site.
i have 3 sets containing:
- a master
- a slave #1 connected with master over lan
- a slave #2 connected with the master via vpn over the internet
my.cnf on master includes in [mysqld] section:
# to save some bandwidth slave_compressed_protocol=true # i tend to generate id based on two last bytes of ipv4 address server-id = 1210 # since i have data partitioned into 3 sets..: auto_increment_increment=3 auto_increment_offset=3 # 3 for server #0, 4 for server #1, 5 for server #2
my slave config:
# to save the bandwidth slave_compressed_protocol=true # to make sure slave tries to reconnect every 3 minutes rather then 24h... if network connection stalls eg because of vpn flap slave-net-timeout = 180 master-connect-retry = 180 server-id = 1220 # same as on master: auto_increment_increment=3 auto_increment_offset=3
i monitor each master-slave-slave pair in two ways:
- i continuously update one of the tables to monitor replication lag
- during every weekend i run consistency check using mk-table-checksum from maatkit
at the beginning i thought that ‘SHOW SLAVE STATUS’ on slave will give me reliable information about replication lag, but it was not the case – i saw situations when slave had 10-20 minutes to catch up, yet lag in status was 0. so.. on the master i’ve created simple cron script run every minute:
start=`date +%s` d=0 while [ $d -lt 60 ] ; do echo "update repl_test set t= NOW(); " |mysql --defaults-file=/etc/mysql/debian.cnf repl_test sleep 3 d=$(( `date +%s` - $start )) donest
while on slave i use nagios script to check if content of my repl_test.repl_test table is not too old:
#!/bin/bash if [ $# -ne 5 ] ; then echo syntax mysql_replication_lag host login pass database db allowed_lag_in_sec exit 2 fi r=`echo "select UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(t) from repl_test;" | mysql -h$1 -u$2 -p$3 -N $4` if [ $r -lt $5 ] ; then echo OK $r sec of lag exit 0 fi echo critical $r sec of lag exit 2
quickly i’ve run into funny scenario – local clocks of servers drifted… and i had negative replication lag, so i sync all sql servers with ntp server couple of times per day.
for weekend consistency check i use:
mk-table-checksum h=$master,u=debian-sys-maint,p=$pass h=$slave,u=debian-sys-maint,p=$pass| mk-checksum-filter
if returned errorcode is non-zero – databases’ content differ, in such a case stdout will contain list of tables/databases which content does not match.
mk-table-checksum runs pretty fast and transfers only hashes of data.. so i have no problem running checksums of multi-gigabyte tables over internet in under a minute.