- PVSM.RU - https://www.pvsm.ru -
This is the next article of the series describing how we’re increasing our service availability in Citymobil (you can read the previous parts here: part 1 [1], part 2 [2], part 3 [3]). In further parts, I’ll talk about the accidents and outages in detail.
Let me begin with a specific example of this type of outage. We deployed an optimization: added USE INDEX in an SQL query; during testing as well as in production, it sped up short queries, but the long ones — slowed down. The long queries slowdown was only noticed in production. As a result, a lot of long parallel queries caused the database to be down for an hour. We thoroughly studied the way USE INDEX worked; we described it in the Do’s and Dont’s file and warned the engineers against the incorrect usage. We also analyzed the query and realized that it retrieves mostly historical data and, therefore, can be run on a separate replica for historical requests. Even if this replica goes down due to an overload, the business will keep running.
We kept stumbling upon the same issues afterward, and at some point, we decided to address this matter. We’d studied the code through and moved all the queries that we could without compromising our service to the replicas. The replicas themselves were divided depending on their level of criticality, so that none of them could fail and stop the service. As a result, we came up with an architecture with the following databases:
This architecture provided us with vast space for growth and reduced a number of crashes due to nonoptimal SQL queries. But it’s still far from being perfect. We are planning to implement sharding, so that we can scale updates and deletes and super sensitive to data freshness queries. MySQL margin of safety is not infinite. We’ll need some heavy artillery in the form of some in-memory database (e.g. Tarantool) soon. I’ll definitely be talking about it in my next articles.
While we were dealing with nonoptimal code and queries, we understood the following: any nonoptimality should be eliminated before it has been released and not after. This decreases the risk of outages and efforts of engineering teams for optimization. If the code has already been deployed with some new releases on top of it, it’s much harder to optimize it. As a result, we introduced a mandatory code review for optimization. It’s carried out by our most experienced engineers, our elite force.
We also started collecting the best code optimization methods suitable for our realia in the Do’s and Dont’s. They are listed below. Please, do not take these practices as undeniable truth and do not try to replicate them blindly. Every method makes sense only for some specific situation and specific business. This is just an example to clarify the specifics:
Examples of these accidents:
To minimize outages due to these reasons, we have to investigate an accident nature every time it occurs. We haven’t figured out a general rule yet. Again, let’s look at some specific examples. Surge coefficients (taxi fare at the time and place of high demand is multiplied by them) stopped working at some point. The reason was that there was a python script working on a database replica server where the data for calculation of coefficients was taken from and the script used up all the memory and replica went down. The script had been running for a while; it was operating right on the replica for the sake of convenience. The problem was solved by restart of the script. The following conclusions were drawn: do not run foreign scripts on a database server (it was written in Do’s and Don’t’s; otherwise, it’d be an empty shot!), monitor the memory usage on a database server and alert via SMS if that server is about to run out of memory.
It’s essential always to draw conclusions and not get comfortable in the «saw the problem, fixed it, forgot about it» kind of situation. Quality service can only be offered if one walks away with a conclusion. Besides that, the SMS alerts are critical — they increase the service quality level, they don’t let it go down and allow us to increase its reliability. Like a mountain climber who gets to a stable position and then pulls himself up to another stable position, but only higher this time.
Monitoring and alerts are not visible, but they act like iron hooks cutting into the rock of unknown preventing us from falling below our service level agreement that we are continually increasing.
What we call «an Easter egg» — is a delayed-action mine that we haven’t tripped yet, even though it’s been existing for a while. Outside of this article, this term is used for undocumented features created on purpose. In our case, it’s not a feature at all, but rather a bug that acts like a time bomb and appears as a side effect of some well-intentioned activity.
For example:
auto_increment
;
Another popular kind of Easter egg is nonoptimal code; to be more specific — nonoptimal SQL query. The table used to be smaller and the workload was lighter — the query worked well. With the linear in time table growth and linear in time workload increase, resources consumption by a database management system was growing quadratically. Usually, that leads to a drastic negative effect: it’s like everything used to be ok and then suddenly — oops!
Rarer scenarios — combination of bugs and Easter eggs. A release with a bug led to enlargement of a database table and increased a number of table rows of a specific kind whilst an already existing Easter egg caused the database overload due to the slower queries to this expanded table.
We used to have some not workload related Easter eggs, though. For instance, a 32-bit auto_increment
field in MYSQL. After a little over 2 billion rows, inserts fail. Therefore, in the modern world, we must use 64-bit auto_increment
fields only. We learned that lesson well.
How to deal with Easter eggs? The answer sounds straightforward: a) search for the old eggs, b) don’t allow the new ones to appear. We are trying to do both. A search for the old eggs goes hand in hand with our continuous code optimization. We appointed two of the most experienced engineers to perform optimization almost full-time. They find queries in slow.log that use databases resources the most; they optimize these queries and the code around them. We low a possibility of new eggs emergence via testing of every commit for optimality performed by the sensei engineers mentioned above. Their task is to point at the mistakes affecting the performance, to suggest the way to make things better and pass this knowledge onto other engineers.
At some point, right after finding another Easter egg, we realized that it was a good thing to look for slow queries, but we should’ve also looked for the queries that appear to be slow but work fast. These are the next contenders for crashing everything in case of another table explosive growth. The stupid but obvious example here is a query that full scans a table of 10 rows without using indexes at all. It will work fast for the time being. However, when the table is big enough, then the query will take the database down. That’s the very Easter egg.
These are the causes that we seem unable to control very well. Put another way, those are the causes that can only be mitigated but not eliminated. For example:
Since even mitigation of an external cause is a long and expensive endeavor, we began to collect statistics for accidents caused by the external reasons and wait for critical mass accumulation. We don’t have a recipe for defining of a critical mass. It’s all about mere intuition. For example, if we were completely down five times due to, let’s say, the DDoS protection service issues, then with every subsequent downtime a need for alternative would become more and more acute.
On the other hand, if we can somehow make everything work with an unavailable external service, we definitely do that. Post-mortem analysis of every outage helps us here. There’s always should be a conclusion. Which means that like it or not — we always come up with a workaround.
Автор: danikin
Источник [4]
Сайт-источник PVSM.RU: https://www.pvsm.ru
Путь до страницы источника: https://www.pvsm.ru/vy-sokaya-proizvoditel-nost/316843
Ссылки в тексте:
[1] part 1: https://habr.com/ru/company/mailru/blog/449034/
[2] part 2: https://habr.com/ru/company/mailru/blog/449310/
[3] part 3: https://habr.com/ru/company/mailru/blog/449708/
[4] Источник: https://habr.com/ru/post/450678/?utm_campaign=450678
Нажмите здесь для печати.