Heinrich Hartmann - A Systems Perspective toward Reliability Engineering at Zalando

[00:00:05] for coming here. Um, I uh traveled and I arrived yesterday and I had just this funny experience where I was coming, um, coming to the hotel and I was like, this is awfully familiar. I know this place somewhere. And then I I checked I was for the last time in Paris, it was 11 years ago, that was when I, uh, like was having some, uh, like hanging around with some PhD students. And my my wife was in the group, just a group of friends that, um, we visited Paris, we actually went to visit, um, see football. And it turns out it was exactly the same hotel. 11 years ago, and I didn't book this one, so it was like a funny coincidence, but I immediately felt at home and I knew where everything was. So that was nice. Um, yeah, I want to talk about reliability engineering, so I myself I really much identify as a reliability engineer. I've been doing that for 10 years. Um, in a former life, I was a mathematician, and somehow I got sucked into, um, computer science and engineering for various purposes, uh, various reasons. But it always struck me that reliability is really unsolved. So when I operated my first WordPress instances, it was like, okay, why is it always going down? And why are my customers or my my users tell me about it and why can't this thing just not work? And I worked as a data scientist for a monitoring vendor, Sikus for like about five years, um, before I went to Zalando and, um, led the reliability engineering there for two and a half years. Um, yeah. For the menu, um, I will give you a little bit of context around Zalando, just that we know the environment and what we are solving. Um, and then I'm going to present a systems angle on reliability engineering. So for this conference, I'm really trying to bridge, um, just the reliability engineering from a technical side, um, to reliability engineering on a more like process and people and meeting side. So how are you actually running it in in a larger organization? And the the tool that I find most promising here is system theory, and we will talk about this a little bit more. Um, then we will go through the different practices, um, as a reliability engineer, so what are we doing? And finally, I have a case study for you, our largest incident in the past years. Um, was with a, um, with a, yeah, thing called Madeupda. We will figure out what that is. All right. Um, Zalando. Many of you may know it, um, it's uh a fashion platform in Europe. It's relatively um new, 2008 it was founded, um, and it has 3,000 software engineers. So the tech organization is quite sizeable, you may ask like, why do you need 3,000 people for a website? And that's a really good question. Um, the answer, or some part of the answer is that it's actually horizontally integrated fairly widely, so we have the own bank, we have the own logistics network, um, and on the on the demand side, we integrate with partners, um, we also have an ad platform, so there's just a lot of, um, technology that supports, um, like a large part of the value chain. Um, yeah, it's it's pretty large, it's pretty fragmented. And on the tech side, this is kind of how it looks. Uh, the picture is a service graph, um, taken from 2019. I really want to do a new one, but this is the image that I I always use for this. Um, and it shows, I think at the time, 1,500 microservices, now we are above 3,000 microservices, so it's gotten a little bit more crowded. And you see really not a lot of structure there, you see some central points, but it's kind of there's a lot of complexity involved in, uh, the Zalando systems. Um, the workload at peak is about 200,000 RPS at the edge, um, the we do have a couple of thousand orders per minute that we are doing. Infrastructure-wise, we are the largest AWS customer in Frankfurt, I think. At least there we're the largest having over 10,000 um EC2 nodes that we are um renting um at peak. Yeah, and again, we have around 3,000 software engineers in around 300 teams. So when I was working as a reliability engineer at Zalando, the the quest that I always had was like, how are we making this reliable? Or why is it reliable, right? It seems so complex, so confusing, like how in on Earth would you actually make this reliable? And if you look at this from a technological perspective, it just seems hopeless, right? There's so many things going on, um, you have so many containers, how are we going to kind of get an end-to-end understanding of all of this, put it somewhere we can understand and and make it nice? And somehow the learning here, um, is if you're just looking at the technology, you will not solve it. You have to look at people, organizational structure together with the technology and and think about it more holistically. And this is true, we can already see at the the fundamentals of like running, um, software companies, uh, which is Conway's law, so you're always having, um, the team structures mirror your, um, technological structure, mirror your team structures, and you also have it in the DevOps mantra, you build it, you run it. It's like, you cannot think technology without people. And, um, somehow the the sad news for me as a reliability engineer is that the larger the company gets, the more the people side becomes complex and more problematic. Um, so I, I come really from this math point of view, I have like done a lot of software engineering, I'm really happy if I can just tinker with software, measure things very precisely and write a nice compiler. But if I'm want to want to solve reliability for Zalando and make Zalando reliable, that's the wrong place to look at. Um, we are looking at, um, meetings, risk management, community of practice things, um, that is driving reliability at the large, and at the medium-sized companies already, um, this knowledge sharing practices, like incident management, playbooks, um, this become more and more important in order to effectively drive reliability.
[00:06:20] Um, so here's like a high-level model that I use all the time to reason about this trade-offs and on which layer we are operating. Um, this is basically the the three layer, kind of socio-technical system model, if you like, right? So at the top, there's really a management chain, which gives structure and organization to everything. Then we have underneath engineering, where we have this mapping from engineering teams to, um, the dots which are applications, so this is where the the technical stuff meets the the human stuff. And yeah, just the thing that I I didn't mention on the slide before, I think it's very important or very good if you have a mapping of a application or technology piece to a single owner. You also have organizations which are doing large applications which are owned by multiple teams. It makes a lot of things so much simpler if you are not doing this if every piece of technology has a clear ownership, so kind of this pyramid works a lot better. And then below that you have the platform. Um, seems rather obvious, like you can draw this picture, but what are you um doing with it? Um, so if you want to drive a certain outcome, like make software reliable or make stuff reliable, you can always attack it on these three layers. You can drive a big project through management, putting some KPI, putting some project managers behind it, um, kind of driving it through the management chain and kind of have the organization do it. You can put it to the engineering layer, so you are facilitating a process that the teams do autonomously. So for example, security updates, you don't have a big project that kind of runs through, let's make everything secure or have the latest versions, but you are giving the teams the ability to understand and the incentives to kind of optimize for that. And then lastly, platform stuff, um, this is the case where the teams no longer have to think about it, it's just done for them. So running servers is no longer a thing or running hardware, or procuring hardware, that's no longer a thing the teams do, something the platform takes care. And the general pattern is always push down. If something was a project you run, try to make it a team capability, if something was a team capability, try to take it away, put it to the platform so the teams don't have to think about it. And yeah, that's that's a useful model at least. Now, going one step further, when it's not longer so simple and you need a little bit of, um, like better tools, um, to understand your patterns. Then the best, like theoretical approach that I know is systems theory. Uh and here, I want to call out the book by Donella Meadows, um, Thinking in Systems, it's quite popular, I think a lot of you will know it, um, but it was really eye-opening reading that for the first time. Um, since it's a framework that allows you to reason about technological and human systems at the same time and draw funny diagrams, and everybody likes diagrams.
[00:09:18] Um, and I must admit that I have not really successfully used systems theory for advanced modeling. Um, the thing I really took away from it on the practical side is fairly basic, and it's basically about feedback loops and I can put it into two slides. So if you have, yeah, you you draw diagrams between boxes that are representing certain amounts, certain processes, um, of things. And you can can have interactions going, um, in the Okay, let me just explain the example.
[00:09:53] So in the reinforcing feedback loop example, you are looking at a bank account and interest payments, so the more you have in a bank account, the more interest you will get. And that gives you a feedback loop, so the the interest itself will increase your balance, you will get more interest, it will kind of spiral out of control or grow hopefully very large. Um, and on the other side, you have feedback loops which are, um, like balancing. So if you are hungry, you might eat something, and then you get less hungry. So it kind of counterbalances this. And the behavior you get from these two different feedback loops are very different, one of them is kind of exponential growth, the other is more like a pendulum or like a like a spring in physics. Um.
[00:10:38] Now, the the reinforcing feedback loops, they are often the cause of problems, so you don't want your kind of measures to kind of lead us to get too large because then your system may not longer be able to support it. And the fundamental thing you can do is to counterbalance such positive feedback loops with a negative feedback loop on the side that controls it. Um, there are various flavors for it, like for example, the heating systems also a good example of that, and if you are looking at larger, uh, like infrastructure engineering, you'll find a lot of those, um, control theory and process engineering. Um, but for the balance example, you may have the habit, the larger your bank account is, the more you are spending, that actually happens to me a lot. So I'm very successful in controlling my balance to not grow too large, just by increasing the spend.
[00:11:27] All right. And at the very heart of it, um, reliability engineering is really about putting the right control loops in place. So you're looking at system behavior at the large and you're trying to figure out how do I control this system, um, so it stabilizes and is successfully supporting production workloads.
[00:11:50] This is like a little bit typical, it's a kind of a fuzzy picture, but I think there's so much truth to this. If you are working at the software as a software developer, um, you have these kind of different kinds of feedback loops that are helping you to understand and to to control basically your system behavior. Um, starting at the the first one, the linter, this is if you are typing the code, you are seeing the red underlines between like things. Direct feedback, very, very fast. Um, you're maybe running some compilers that give you feedback, then tests, and then when it passes through CI, you already have a high confidence level that it may be actually working. The next thing is really this production workload, which I would call the DevOps control loop, but it's probably like the whole thing can also be viewed as a DevOps control loop. Um, but you have this kind of production productionizing it, you're deploying it, you're monitoring it, if it fails, you want to get alerted, and then you start debugging, and this is something the developer does again. And then lastly, when my first work backlog I didn't have any of that, I was just my my customers tell told me once it broke and then I kind of restarted my sequel. Um, here the mantra is speed is reliability. So you want all this feedback loops to be fast, to be effective, not too many delays, and ideally you want to want to work inwards. You don't want your customer tell you that something was broken, you want your monitoring systems or your tests to tell you something is broken. And yeah, scaling one step out. This is the current kind of Zalando workflow for reliability, like at a larger company level. We will visit some of these things later on, um, but I I just want to This was was made by a guy who actually doesn't do systems theory, um, just arrived at that, so naturally if you map out the processes, you kind of see those like more complex feedback loops emerge. So, I started talking about this early 2024, um, and it was so great for me to see that Google is actually on the same train. So they published a big paper in login in December 2024, um, with Ben Trainor, Sloss and Tim Falsona, so Ben Trainor Sloss is the guy who invented SRI and like introduced it in 2003, when Google for the first time was um automating their data center operations and uh Tim Falsona is like, I think at Google for an equally long time, um, driving SRI. And they adopted systems engineering and safety engineering practices, um, from like safety engineering from like the more physical side of engineering, um, to their, um, yeah, operations. And, um, they are citing a methodology called, uh, Stamp, and there's a handbook for it, which there are so many acronyms and you never, they never really match, but it's it's basically the same thing. Um, and here I just put this quote that I'm going to read to you, um, from the original Google article. Um, safety is an emergent property that can only be analyzed at the system level, rather than at as an attribute of individual system components. Accidents are complex interactions between system components, including human operators and software. So there's never a single root cause for anything, it's always a complex, um, interaction. The Stamp methodology offers a robust framework for understanding and mitigating risks in the complex, in complex socio-technical systems. So it tells you, um, you have to look at the whole system, um, and there's humans and operators in the loop and essentially the systems engineering, the system thinking is the framework that allows you to make progress. So, let's look at some of our reliability engineering practices from this angle and see how they look like and how they fit into it. So the first thing is alerting. Um, you're all familiar with alerting, so why are we doing this? We want to be reducing our time to detect issues.
[00:15:54] Alerting is, um, yeah, falls as at a critical place between like an automated control loop and the human control loop, so it's the place where you are switching domains, where you're involving a human. Um, in this diagram, I pictured a system under normal operation that may have a fault occurring of some sort, then you're switching in another state, which I just called faulty operation. Um, in an ideal case, you have your applications take care of this problem by themselves. So you may have some infrastructure components that are just restarting your service, so you have this automated control loop, which sometimes is called self-healing properties, for example, of Kubernetes or of System D, and also I would put like automated error handling, um, into that thing. So you have an exception, you're able to close and stabilize the system, um, on your own. Alerting is really a last resort. Um, the system can no longer figure it out, we have to cross into the human domain, get an operator in, hopefully they can mitigate it, so we are back to normal operation.
[00:16:55] It also has a place in this DevOps loop, um, centrally when the developer gets involved again. Um, here's how alerting looks at Zalando. Um, the the thing I want to want to point out that, um, a good alert description should tell you about the impact, so here it's about like a certain stream had a certain error rate, so we know certain users are already, um, impacted. And it should tell you what to do ideally, and we have in this case, um, 12 playbooks or 15 playbooks that actually match. So people have, um, when setting up this alert, it's just not just a signal, a server is unhappy, it's a signal very specific user interactions are failing right now. Here are 5 to 10 things that you can try to fix it, and here's what we know about this problem, and here are pointers on how to debug it. Um, so it tries to kind of jump start you on the journey and make this feedback loop more more fast.
[00:17:57] Um, the incident process.
[00:18:00] Yes.
[00:18:03] This is a machine-generated one. Yeah. 15 Playbox that actually match. So people have, um, when setting up this alert, it's just not just a signal a server is unhappy, it's a signal very specific user interactions are failing right now. Here are five to ten things that you can try to fix it, and here's what we know about this problem and here are pointers on how to debug it. Um, so it tries to kind of jumpstart you in that journey and make this feedback loop more more fast.
[00:17:56] the incident process. Yes.
[00:18:07] This is a machine generated one.
[00:18:15] Yeah, many people look at the incident process as something relatively boring, okay, you are writing up these reports once something happened and okay, you are done with it. You can't won't wait until you do get other work again. Um, I look at postmortems, um, as kind of a gold mine for reliability engineering. So if you're really think about how do you make Zalando more reliable? What's the best thought? Like if you knew at that, right? Your onboarding. So how would you actually make progress? Reading the postmortems from the last 12 months is a great start. It's kind of this laser beams to your most fragile parts of your system, where you already know that things broke and most most vulnerable. And often times they really lead you into like really good places to start optimizing. It's kind of the weakest spots in your organization. So, uh, it's really
[00:19:12] people tell me they have complex reliability problems in certain domains. This is the first question I ask, are you, do you have a postmortem process, an incident process, how effective is it, how is the culture around it? Are people really learning from this? You want to establish this kind of high-level feedback loop first before you start talking about, do you have tracing? Do you have technical things that are there? Like do you, are you talking about this with the right people in the right cadence? That is really, really important. So how are we doing, are we doing this? For the incident handling, incident theater itself, we have a, um, a chatbot that, um, works with Google Chat, which we use. Um, I could talk about this longer, but essentially it's, um, relatively well known technology. And then we have a postmortem template. Also there, um, there's some automation around it, it gets automatically created. Um, but there's nothing too much surprising here. We basically took Google's template, um, and, um, reused it.
[00:20:16] On the system's view, my perspective on the, um, on the incident process is like a second order control loop. So you have kind of this Dev DevOps control loop that the developer is, um, is operating on the in the heart of it, and once that kind of fails to keep the system stable, the second order control loop comes in. And the outcome that we want to have out of the postmortem is improvements, improvements to the system, but also improvements to the process. So sometimes we are learning our testing practice was not good enough, our policies are not good enough when it comes to deployment. So we may actually fix the loops and not just the systems as as part of the postmortem process.
[00:20:57] So the weekly operational review meeting. Um, that's maybe something you've heard of, but I, I know many companies who are not practicing this and Zalando has just recently started to really embrace this as a means to drive reliability. And the basic principle behind it is kind of a management mantra, you get what you inspect. So how about if you want reliable systems, we start inspecting operational practices and reliability. And the key tool that allows us to do this is the report we called the reliability report. Um, that's, I would say like this is the most impactful like 500 lines of Python that, um, my domain ever wrote. Um, so it's a Python script that just pulls data from everywhere and, um, creates a Google document. And it's it's very important that it's a Google Doc. Why is that important? Um, at Zalando we run all meetings with Google Doc and we have this kind of silent reading at the beginning and then we talk about it. Um, and we use the comment feature a lot. And we don't didn't want to build a report which is just read by somebody and sent to somebody's email box. We wanted to facilitate a meeting. We wanted to make it very, very easy for all kinds of organization to set up a weekly review meeting where they are reading a document together. And in order to get that established, feeding into this process was extremely important. Um, it's malleable, everybody can edit it, if there are data problems in there, or missing data, you just overwrite it. Um, it's it's a cascade, so we are like looking at this at the department level, team level sometimes, but it kind of goes to the business units, it goes to the company global level. And every manager is kind of um, asked to summarize their documents and contribute those sections up. If there are errors in their report, they can also correct the errors in the next larger document, so you can can kind of fix it this way if the automation is not perfect. And then finally,
[00:23:00] this is very recent actually, so the worm was used to move to Tuesdays like three weeks ago at the beginning of March in order to effectively feed the Zalando business steering meeting. So now the business steering is downstream of the war meeting. So basically our VP of engineering gets holistic understanding of operations in the last week, Tuesday morning, so he can go Tuesday afternoon to the business stakeholders and explain to them KPIs that were nudging because of operational issues, potential operational issues or other things that need attention. So, um,
[00:23:37] it, yeah, I think it's a great success to have like operational topics bubble up to business stakeholders. That's fairly recent to be really structuredly that, um, clear.
[00:23:50] So how does the review look? I want to just show you a little bit more of the, uh, of the content of this document. The first one is incident review, so for every incident that happened, um, that ideally would be maybe 10, sometimes it's 20, sometimes it's one. Um, but we have around like 10 lines like this in the global document. Um, it's always, like the structure is at the top, there is a business domain, so it's, it's grouped into different domains, and every domain has a representative, ideally a single representative in every meeting. So the reporting logic has to kind of adapt to the level you are holding the meeting and to the like levels of stakeholders that are in there. So in this case, this comes from our infrastructure worm, and we have a subgroup which is called cloud infrastructure, and the head of cloud infrastructure kind of owns that section. Um, then we have the alert title, we have some metadata around it, detection time and then the impact. The impact is a little bit interesting, I can talk about the severities and how we define them, but look at, um, the impact is multi-dimensional. And in this case it affected 20 employees, that's the only impact dimension we were able to determine. With larger customer facing incidents, we will probably also have GMV impact, like how much revenue did we lose? And then, um, the take action, this is where the managerial steering comes in. Every owner is asked to concisely summarize three things, what was the impact or the effect of it, what was the cause and which actions are you taking. And that is linking to this meta-level, um, review. So we are not discussing what went wrong, uh, like how could that happen, um, or details of the postmortem. We are trying to understand in this meeting, is the team effective at helping themselves? Have they, like successfully diagnosed it, have they, are they taking sensible actions? So we're basically trying to see does the machine work, not kind of go deeper and see what exactly went wrong in this incident. That's kind of downstream from that meeting.
[00:25:55] Um, we have an SLO table as well, which follows a similar, similar structure. So you again, you have this grouping into domains and you have, um, um, yeah, in this case, like lines which don't correspond to incidents to do business operations. So our SLOs are not anchored in applications, they are anchored in processes that we want to support, that are ideally meaningful for our customers. And then the SLO logic tells you, was it reliable or not? If it's red, we want some comments about it. So why is it red, what are you doing in order to improve?
[00:26:30] And this kind of the pattern, right? So it's always managerial content on this side, it's always grouped into domains and basically, if you're preparing such a meeting, you just look for your sections and then you fill in the comments where appropriate.
[00:26:44] Um, here's another report about on call health that follows that same structure and there's also this kind of line that goes red with management is always easy. You always using red, amber, green and then they know what to do. Um, you don't want to be red, right? It's like that. Okay. Um, and if it's red, you kind of explain why. Um, here it's about on-call health. So this is about how many interruptions did individual engineers have? It's again group by organization, so in this case this is infrastructure domain, we have like 10 on-call teams. Every on-call team has a single engineer who was on call that week. We are not putting the name here, but each line effectively corresponds to a single engineer's experience. And we see at which day of the week they had how many interruptions. So one guy got 15 interruptions, um, on Tuesday. Um, and a total of like 16, the other days were kind of fine. And then you kind of discuss it, right? That's actually not something that we like. I think in this case it was likely just one incident that triggered 15 alerts in a very short period of time, so probably not something you want to spend a lot of time on. Um, but in other cases you just saw a constant load. And just having this report shown to management really helped tremendously with improving the situation of many incident responders that were kind of invisibly trapped in a situation where they're basically constantly interrupted. Um, and yeah, I mean, depending on the workload of the team that maybe like sometimes just the best thing you can do, but having that constantly surfaced really helps put attention to it and eventually resolve those things. Also downstream of this meeting, um, from the data that is discussed, if you are putting them into spreadsheet or have some automation around it, you can nicely spot patterns and then hopefully also steer larger investments. So that's just one of the, the report we do every year. Um, we just look at all the incidents and kind of break them down into several categories and kind of look at how our which would cause us, for example, are causing us the most things. So if you are asking as a platform, which kind of domain should be improving, then this is a good starting point. Yeah.
[00:29:05] Okay, I will just skip this.
[00:29:09] Risk management. Um, so that's the last kind of method in the reliability engineering practice that I want to talk a little bit about. So, why are we doing risk management? So there's one specific event every year that we care about a lot, which is Cyberweek. At Cyberweek, this is like a very big sales event. Um, we are running a ton of campaigns, we're driving a lot of traffic. Um, I think, yeah, I don't know, but it's, it's very significant when it comes to the the total revenue of the year. how we are doing in Cyberweek, so we do want to make sure that we are reliable at this point in time. And we don't want to use a postmortem loop to kind of get this reliable. Because it means we have to run 10, 15 Cyberweeks or I don't know, and every time we fail and then we figure out how to do it better next time. Now we want to have it this time already reliable. And so, in addition to this kind of reactive feedback loop we have with postmortems, we want to actually have a proactive feedback loop. So the starting point are not failures, but risks that engineers identify as part of their, um, as part of their yeah, daily workings or their understanding of the system. Um, so you can think of the creation of a risk somewhat similar to a postmortem or an incident. It's something that is flagged. And then you have a similar process like the incident process where you are like, um, triaging this in some way that's also happening in the incident theater, um, but in the postmortem process really collecting data about it and then deriving action items and here it's similar, you are vetting a risk, if it's really worth solving, how much impact there is, how much, how likely it is to occur, and then you're you're following some some process, which ideally resolves in mitigations, which are stabilizing the system before it went out of control. Um, here's how that looks like.
[00:31:07] This is, um, just one example of a risk, um, that we have in our register. Um, here's actually incidents where this occurred in the past, um, here possible actions that we can take to, um, to, um, mitigate it. It's very important that we have clear ownership, so we need to be able to map risk to domains and in this case, um, head of engineering or actually a team, observability is a team, is is owning that risk. So they are looked at for mitigating. And as you can imagine, in the weekly operational review, we have a table with a risk profile of every organization, and if there's too much red or overdue, then we expect the manager to kind of leave a note. And here it's like, yeah, we want to review this by the end of the month, so expect some movement in the next days. What also here, infrastructure has already mitigated 52 risks, so they were quite successful in like de-risking known things. Um, above there is the table of our risk mitigation burn down for Cyberweek last year. So the blue line is really the, the number of open risks, and you see like in like this is probably August, September, uh, the, the teams got really active with flagging all those risks, is something we centrally asked them to do. And then for Cyberweek, we were driving the Cyberweek relevant risks, this is what what's pictured here, really down hopefully to close to zero. Um, so we are at least happy with the state. Not necessarily everything is fully mitigated, but we are saying sufficiently mitigated so we're happy to go to this event. And as you can see, the numbers are pretty high, so we are spending a lot of time on risk mitigation ahead of this event. How about the process?
[00:32:51] Yeah, for sure. Um, we have a project that is, um, responsible to kind of prepare everything and this is conversations that happen on this kind of project. There's also commercial side which are planning the campaigns and there's the technical side that prepares the system. And this is the point where those conversations are had, like, are we okay with kind of doing this, what can you do as business stakeholders to help us? But the risk process, as I described it here, that's really something that we are just looking at on the, um, on the engineering level. And what I didn't say is it started with Cyberweek, but we very very much make this something we are doing throughout the year, the whole process is sustainable and something we expect the tech teams to do. And how about the overdue work? Uh, here in this table. Yeah, this is just overdue. So we have SLAs essentially for all the steps, so it becomes red if you're not looked at it in the given time frame. How are the SLAs defined?
[00:33:54] the SLAs are on the time you have for each step. So we expect within 10 days for you to have triaged it and brought it to a good state. And then you have actually in the risk process a decision step where a, um, like manager is actually saying we want to get move this to execution, to mitigation, so you you're setting a timeline for this. Or we are punting. So you're allowed to kind of pause a risk and say, now we are not prioritizing this for half a year or so. And then we come back. And this will all be green. We are here looking, is the risk mitigation machine working? We are and this is, this is really a good point. We're actually not so much looking at how much risk do we have in the system. We're not looking here the domains become red if they are really risky, they become red if they are not effectively triaging their risks and managing the process.
[00:34:45] Okay, um, I think I am, I have five minutes.
[00:34:50] So I can tell you a story of the metapeda incident, which is also like, um, touching on some of the things that we have seen.
[00:35:00] So the setting is, it's Tuesday, November 17th, and this is prime Cyberweek time. It's the Cyber Tuesday. Um, Friday is the big event, but already on Tuesday we have a lot of traffic. We just went through this mitigation exercise and, um, we have a lot of attention now on our systems, um, at this point in time, um, just looking if they are, they are some like noises you can hear. that you can preemptively already fix. Um, but yeah. So far so good.
[00:35:29] Um, at 12:56, this PR gets merged by an infrastructure team. Um, that intention of this pull request is to remove certain access privileges to an automated process in the test clusters.
[00:35:45] So despite it being Cyberweek, this is not actually a system that was deemed production relevant. Because it's, um, if it goes down, nothing happens. It's a system that is managing AWS infrastructure. So we're not don't want to make any changes during Cyberweek, so the system if it fails, nothing happens, so it was deemed very unrisky. Um, also this change was supposed to just go to test. So really business as usual.
[00:36:12] Um, unfortunately, there was some like P that made its way to one of the config files and instead of metadata, this file reads now metapeda. Um, if you look at the file name, um,
[00:36:25] some of you may get already scared. This is slash root slash Route 53 hosted zones/HTM certificates/f.yml. So that is our root templates for configuring Route 53, that's DNS in all our AWS clusters. And unfortunately, there was no metadata entry anymore.
[00:36:45] So when this PR got merged, um, we have a component called AWS Lifecycle Manager that was reading that config and could no longer find DNS information, DNS configs. Um, so it decided, uh, well, apparently you don't need DNS anymore.
[00:37:02] And it, uh, it it deleted the hosted zones in AWS. This is what this thing manages. Now, if you are familiar with cloud infrastructure, you may say, well, that's not so bad, right? Because you cannot delete a hosted zone, like a DNS zone, if you have all this DNS entries in it. Surely AWS will stop you from doing this. It turned out we had already run into this problem when we are deleting AWS accounts. It's also a routine thing that we often times do. For some reason we have like 300 AWS accounts, so it's really something that is frequent. And the team already optimized for this, so it had this little lambda function that just eagerly cleaned out the DNS hosted zones, um, as kind of an extra process that was bolted on. So this operation was incredibly effective and like a mere like 10 minutes later, we deleted tenths of thousands of DNS entries for all our internal services in Cyberweek and, um, yeah, Zalando was down. Um, unfortunately, all our internal monitoring tools were hosted on the same infrastructure. only AWS will stop you from doing this. It turned out we had already run into this problem when we are deleting AWS accounts. It's also a routine thing that we often times do. For some reason we have like 300 AWS accounts, so it's really something that is frequent. And the team already optimized for this, so it had this little lambda function that just eagerly cleaned out the DNS hosted zones as kind of an extra process that was bolted on. So this operation was incredibly effective, and like a mere like 10 minutes later, we deleted tens of thousands of DNS entries for all our internal services inside our week. And um yeah, so London was down. Unfortunately, all our internal monitoring tools were hosted on the same infrastructure.
[00:38:05] So, but Google chat worked. So, um, how did we get ourselves out of this? Um, infrastructure team got um active relatively soon, like minutes afterwards also, the alerts were firing. Um the order curve um went down before all our alerting systems went down, so we got those alerts. Um and then you could see that this is a DNS problem, because DNS was just like very obvious. And there were not many deployments, so you could also readily identify the culprit relatively soon. So that thing was rolled back, but you cannot really roll that change back. It was in a, in a partially failed state and also this lambda function that cleaned out the records was not reversible. Um so we ran back like kind of to the 90s, uh and uh wrote the, back in the day you didn't have DNS names at least if you're doing your LAN parties and so on. So what you could do is you could just share IP addresses and put them in some tables. And we had a Google Doc where we did exactly that. So with some critical infrastructure components we were really putting that into a /etc/hosts file, to kind of get into the AWS accounts again. And then we made a long, long list of thousands of DNS records together with like about 200 engineers in the video chat, um to kind of get our systems back to work. And we had like 10 engineers which were actively like restoring those entries um in a pretty manual way. But it was quite effective, within um three hours, so 2:45, um we um were back in business. So monitoring started to recover around 2:30, um orders came back at 3:40, 45. And then at 8 PM all systems were fully operational again and this was the order curve, right? So it dropped basically to zero for a long period and then it came back up. So a lot of things um went, went wrong here. And um the, the most blatant is really that we didn't have good feedback around it. This AWS changes were really fire and forget. So you were creating a change and at that time this infrastructure automation didn't give you any feedback. So there was no feedback that made up that PS is actually illegal. So your linter was not there, you don't have schema validation for that. Um there was no preview that checked the intention of your pull request against what it would be happening in production, relatively standard feature for infrastructure management management tools at this point in time, but we hadn't integrated this into our tooling. And then um our deployment policy was ineffective.
[00:40:51] Um we just looked at systems if they went down, they could impact week, which we didn't look at systems that if they do something, they can impact week. And we also looked at code changes and deployments, so hence what called deployment policies, we looked at things where we are changing code. We didn't look at configuration changes and this was flying under configuration change. So a lot of learnings here um on the kind of um deployment and um yeah. at this feedback loop but also at the the risk management feedback loop a little bit. Um. So, yeah, a lot of, a lot of feedback loops that were either missing or didn't properly work. So with this, uh, I want to conclude here. Um, just with a quote reminding everybody that it is really about control loops to stabilize production systems and that's what reliability engineering is all about. Thank you.
[00:42:07] Hi, thank you for the talk. So, what would happen if the PR gets merged again now after this incident?
[00:42:14] Um, yeah. Actually, like a lot of things. Let me put that slide again. So, we have validation now in the um for the, for the whole manifest. So, this would be either flagged in your editor or in a pre-commit hook, so you can actually not check it in when you have this kind of typos. Um but I would say that the most effective um thing that we built that uh helps prevent a large variety of this kind of intentional failures is a preview feature. So how that looks is once you are putting that up the pull request, you get a new comment injected that will tell you exactly what things it's not 100%, but are likely to change. There's a little bit of risk still here, but in this case it would tell you, you are deleting like 200 hosted zones or 30,000 DNS entries with that change. And the last thing that I haven't talked about is really the staging. We had that for our Kubernetes infrastructure already that we are not just deploying stuff to, to all our AWS accounts, but we have like a multi-stage process. We first um deploy it to some internal test clusters, then to like all our test clusters and accounts, then to the first set of production accounts and so on. So, it's very likely that we would have caught this in the early um staging phases. Yeah.
[00:43:46] You were talking about risk mitigation. What are the criteria to prioritize risk mitigation actions against features development and pushing new features?
[00:43:58] Yeah, that's a very good question. Um.
[00:44:05] I don't really have a blanket answer to this. Um the information that we collect is um the severity, um so how much impact do you think this is going to have? And the probability. So, are you thinking this could like occur once every five years, once every 10 years or like rough rough estimates? Um, but essentially we are telling managers on each level,
[00:44:29] like, look into this and do your job. in prioritizing those things. And um, for Cyber Week, this is really seasonal, a lot of the time engineering teams will not care so much. But if you fuck up Cyber Week and you have open risks, that will be a problem. And our like very senior leadership, I think was setting the tone here. Interestingly, the manager was not accountable for the full domain, but he was kind of sponsoring the Cyber Week efforts. And he was like quite effective at setting a high bar, let me phrase it like this for these things, and also communicating this to their colleagues. So, I think there's a lot about just awareness and attention given to this by senior management that will ultimately inform those tradeoffs. But, um,
[00:45:21] I think if you want to drive this completely quantitatively, this is just either impossible or way too expensive. So you have to have a good amount of like human judgment in there. And I think it's probably best anchored in the reporting line.
[00:45:48] Thank you. I thank you for the presentation. Um for the risk register, I did not uh I couldn't hear uh how much time uh do you uh discover uh do you uh register the risk before the Cyber Week?
[00:46:06] Um, it's roughly half a year. Um, I think in general it's a continuous practice now. So at every point in time, if you discover a risk and this is often the case with an incidence where through the incident we discover a risk that is maybe not materialized, maybe had a near miss, so we now have this um this like lever if you like.
[00:46:29] So if something like that happens, they reflex, okay, let's register this as a risk. So you kind of make sure it's not getting forgotten, which happens a lot by the way with action items from postmortem. So at sometimes you are like, okay, yeah, this was really bad, let's do this five things. And there's more like a wish list and you may do one and then it kind of slips. And with risks it's something that will live and will surface all the time. So, it's in general continuous, um for Cyber Week, we really ask all the teams, um, yeah, in July, August to prepare for November. Um and if you look at the curve, we're roughly end of August, September where they really start registering the risks. And reopening a lot of risks. So, at this point, we're doing this the fifth, sixth year like that, um you see a lot of like known friends that just pop up every every year again and then it's like, yeah, we should look at this again. And there's a lot of architecture change. So a lot of the things, okay, this could go down, it's a little bit of a generic risk of this, but this happens, and yeah, the the mitigations that you took last year may no longer be effective, so it's probably good to double check.
[00:47:35] Thank you.
[00:47:42] Salut. Euh. You show as a pretty impressive reporting uh document and uh.
[00:47:55] I was wondering, uh in real life, what I've seen, I I worked at Doctor a bit, so they have a big teams and I've seen many, many times uh crappy post postmortem.
[00:48:07] And I was wondering, how do you keep your people motivated to, or what kind of procedure you put in place to assess the quality of the data that uh IC's were producing? So, in the end, you have nice and usable uh reportings.
[00:48:24] It's a good question. And we definitely have a large variance. Um when I started, we treated basically every alert as an incident, and we had hundreds of incidents every week, so there should be hundreds of postmortem documents every week. And 95% of them were complete trash. Um and what we then did, we really upped the bar for what is an incident significantly, we introduced the concept of an anomaly, which was just an organizational hiccup that doesn't deserve a postmortem.
[00:48:55] So right now we are really talking 10, 11 postmortems every week for 3,000 software engineers. So you are not writing all that many postmortems all the time. So that that is I think one thing that really helped, so this the ones that you write are um deserve attention. And then I mean, ultimately, I think that you get what you inspect is the, is the um is the most effective angle. If your management chain cares and reads those documents, they magically get better. Um it's like it's always, uh if if nobody cares, engineers will take the shortcuts. And um so there are practices around it like uh roles. So every postmortem needs a reviewer, which should be a principal engineer. And every postmortem has an owner, which is depending on the severity either a head, a director or a VP. So, uh, you have kind of formal roles that kind of assist this. You have also a strong should that every postmortem should have a, have a review. But, um, yeah, I mean, doing less postmortems, caring more about postmortems, really being serious about expecting the learnings. I think that is that is a good approach. And the next thing, I think that we will see in the next year is just AI helping with that. So, um, if you just take the chat history of a incident and put it to an LLM, with the right prompt, you can actually get quite far with at least the timeline, the summary, the impact sections. Um and probably also action items. So this is, yeah, what I what I want to do and help engineers with in the next year that they don't have to do so much work.
[00:50:59] Hello, thank you for your presentation. Uh it seems that you have a big business event during a year, as Black Friday.
[00:51:12] Um, does your direction uh is uh enough confident in uh this uh feedback loop on uh your reliability system, um or does um uh or do you have freeze period, uh a couple days before a big event to avoid and limit uh a big risk?
[00:51:28] Yeah, I mean, deployments are freeze throughout Cyber Week. Um and in ample time before, so we don't want the system changing during Cyber Week. It's part of the routine. Yeah.
[00:52:05] Thank you for the great talk. I had a question. You said um if something the project, we make it a team capability to push it down. Do you have an example about that?
[00:52:19] Yeah, I think low testing is is a good one, um also risk mitigation if you like. So low testing it really started out as a project where um we were, yeah, just asking all the teams in a project framework to create those scenarios. Then you would have three or four centrally organized load tests which are like simulating the global load. And then you forget about it and then you do it next year again. And now we have like low testing infrastructure that all the teams have available as part of their like application life cycle, they write load tests in an automated way, they can manage them along with the code, um you have regular load tests going throughout the year. Like every month you can participate in, in one of those or you will be participated. Um, I mean, there there's a trajectory, but you can kind of see how that's um becoming like just part of um the what the team's routines are. And you can also take it one step further when you work at Facebook. They have um chaos engineering on their platform, so they will routinely take out whole DCs. And you probably don't even know as a team because it's it's just happens so frequently, so just the your deployment fabric and your guidelines around how you are writing the code are kind of enforce somehow this level of reliability. So you don't have to think about this so much.
[00:53:42] One more question.
[00:53:45] No?
[00:53:47] Okay, thank you, enrich.

Transcript