podcast about evolution of mysql at booking.com

very interesting, content-heavy, interview with Kristian Köhntopp about his work at booking.com: 1, 2 [ alternatively on youtube: 1, 2 ].

topics discussed:

  • scale of operations:
    • thousands of relatively small database servers,
    • using database replicas instead of caches to present simpler infrastructure to developers,
    • mysqls managed by team of < 20 dba devops automatizing away routine tasks, without that business would not be able to grow over x400 in the past decade,
    • at their scale manual interventions done via SSH are ticket-worthy incidents,
    • node provisioning, adding replicas, failovers are automated [ although the last one came relatively late ],
    • challenge has always been to stay alive, and not die from data or traffic growth,
  • security aspects:
    • automated rotation of all db-related credentials every 30 days,
    • upgrading version of database engine by provisioning new server, promoting it to master or slave, phasing out its predecessor,
  • dealing with data / load growth:
    • loose / JSON-based data representation is common in the early stage, as products catch up it’s eventually converted to more normalized form,
    • with every ~10x data grow for given subsystem it’s commonly re-designed to handle the growth,
    • booking went from being PostgreSQL-based, to MySQL only to database-polyglot where data is persisted in MySQL, Cassandra, Elastic,
  • self hosting vs using public cloud:
    • 6-8x cost saving when running on self-managed hardware compared to using raw storage/computer from public cloud – like AWS’s ec2,
    • surprising limitations of Amazon’s RDS services:
      • it solves easy problems but lacks more advanced features including observability & functionalities needed at a large scale with high availability requirements,
      • lack of incentives for Amazon to provide better insights about utilization level of provided service which will lead to over-provisioning,
      • random maintenance windows that can lead to outages of infrastructure relying on the the DBaaS,
      • using cloud for any non-trivial workloads requires DB admins regardless if it’s on-prem or in cloud,
  • dealing with new versions of MySQL where optimizer improvements lead to unintended performance degradations/changes in query plans,
  • staffing is hard – finding competent DBas is difficult, finding ones who can write code to automate reliable database management – much more so,
  • organizational maturity:
    • creeping compliance requirements,
    • formal tracking of dependencies, to the point that applications are blocked, on the firewall level, from talking to other systems if those systems are not formally defined as needed,
    • forcing organization to mature by demanding documentation, pulling cables at random till everyone is comfortable with and has trust in failover mechanisms,
    • SLAs and outage budget dictating which automation efforts must be prioritized,
  • management:
    • need for c-level to have understanding of the technology and its implications on the business; lack of it leads to bad decissions,
    • idea of innovation tokens that every startup has and needs to spend wisely.

serveralnines – host of the interview – provides database management products and consulting services similar to Percona, they are competing with database offers from the cloud behemoths like AWS, GCP and Azure.

Jean-François Gagné is another fame of booking.com who shared a lot of insights about MySQL on his blog and in numerous presentations.

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)