very interesting, content-heavy, interview with Kristian Köhntopp about his work at booking.com: 1, 2 [ alternatively on youtube: 1, 2 ].
topics discussed:
- scale of operations:
- thousands of relatively small database servers,
- using database replicas instead of caches to present simpler infrastructure to developers,
- mysqls managed by team of < 20 dba devops automatizing away routine tasks, without that business would not be able to grow over x400 in the past decade,
- at their scale manual interventions done via SSH are ticket-worthy incidents,
- node provisioning, adding replicas, failovers are automated [ although the last one came relatively late ],
- challenge has always been to stay alive, and not die from data or traffic growth,
- security aspects:
- automated rotation of all db-related credentials every 30 days,
- upgrading version of database engine by provisioning new server, promoting it to master or slave, phasing out its predecessor,
- dealing with data / load growth:
- loose / JSON-based data representation is common in the early stage, as products catch up it’s eventually converted to more normalized form,
- with every ~10x data grow for given subsystem it’s commonly re-designed to handle the growth,
- booking went from being PostgreSQL-based, to MySQL only to database-polyglot where data is persisted in MySQL, Cassandra, Elastic,
- self hosting vs using public cloud:
- 6-8x cost saving when running on self-managed hardware compared to using raw storage/computer from public cloud – like AWS’s ec2,
- surprising limitations of Amazon’s RDS services:
- it solves easy problems but lacks more advanced features including observability & functionalities needed at a large scale with high availability requirements,
- lack of incentives for Amazon to provide better insights about utilization level of provided service which will lead to over-provisioning,
- random maintenance windows that can lead to outages of infrastructure relying on the the DBaaS,
- using cloud for any non-trivial workloads requires DB admins regardless if it’s on-prem or in cloud,
- dealing with new versions of MySQL where optimizer improvements lead to unintended performance degradations/changes in query plans,
- staffing is hard – finding competent DBas is difficult, finding ones who can write code to automate reliable database management – much more so,
- organizational maturity:
- creeping compliance requirements,
- formal tracking of dependencies, to the point that applications are blocked, on the firewall level, from talking to other systems if those systems are not formally defined as needed,
- forcing organization to mature by demanding documentation, pulling cables at random till everyone is comfortable with and has trust in failover mechanisms,
- SLAs and outage budget dictating which automation efforts must be prioritized,
- management:
- need for c-level to have understanding of the technology and its implications on the business; lack of it leads to bad decissions,
- idea of innovation tokens that every startup has and needs to spend wisely.
serveralnines – host of the interview – provides database management products and consulting services similar to Percona, they are competing with database offers from the cloud behemoths like AWS, GCP and Azure.
Jean-François Gagné is another fame of booking.com who shared a lot of insights about MySQL on his blog and in numerous presentations.