mysql replication – some tips

so.. i have couple of master machines and few slaves. some of slaves are offsite – 60ms away from primary site.

i have 3 sets containing:

a master
a slave #1 connected with master over lan
a slave #2 connected with the master via vpn over the internet

my.cnf on master includes in [mysqld] section:

# to save some bandwidth
slave_compressed_protocol=true
# i tend to generate id based on two last bytes of ipv4 address
server-id               = 1210
# since i have data partitioned into 3 sets..:
auto_increment_increment=3
auto_increment_offset=3 # 3 for server #0, 4 for server #1, 5 for server #2

my slave config:

# to save the bandwidth
slave_compressed_protocol=true
# to make sure slave tries to reconnect every 3 minutes rather then 24h... if network connection stalls eg because of vpn flap
slave-net-timeout = 180
master-connect-retry = 180
server-id               = 1220
# same as on master:
auto_increment_increment=3
auto_increment_offset=3

i monitor each master-slave-slave pair in two ways:

i continuously update one of the tables to monitor replication lag
during every weekend i run consistency check using mk-table-checksum from maatkit

at the beginning i thought that ‘SHOW SLAVE STATUS’ on slave will give me reliable information about replication lag, but it was not the case – i saw situations when slave had 10-20 minutes to catch up, yet lag in status was 0. so.. on the master i’ve created simple cron script run every minute:

start=`date +%s`
d=0
while [ $d -lt 60 ] ; do
        echo "update repl_test set t= NOW(); " |mysql --defaults-file=/etc/mysql/debian.cnf repl_test
        sleep 3
        d=$(( `date +%s` - $start ))
donest

while on slave i use nagios script to check if content of my repl_test.repl_test table is not too old:

#!/bin/bash
if [ $# -ne 5 ] ; then
        echo syntax mysql_replication_lag host login pass database db allowed_lag_in_sec
        exit 2
fi
r=`echo "select UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(t) from repl_test;" | mysql  -h$1 -u$2 -p$3 -N $4`
if [ $r -lt $5 ] ; then
        echo OK $r sec of lag
        exit 0
fi
echo critical $r sec of lag
exit 2

quickly i’ve run into funny scenario – local clocks of servers drifted… and i had negative replication lag, so i sync all sql servers with ntp server couple of times per day.

for weekend consistency check i use:

mk-table-checksum h=$master,u=debian-sys-maint,p=$pass h=$slave,u=debian-sys-maint,p=$pass| mk-checksum-filter

if returned errorcode is non-zero – databases’ content differ, in such a case stdout will contain list of tables/databases which content does not match.
mk-table-checksum runs pretty fast and transfers only hashes of data.. so i have no problem running checksums of multi-gigabyte tables over internet in under a minute.

Leave a Reply Cancel reply