Steve Smith
Duration: 58 min
Views: 390
3 likes
Published: November 28, 2022

Transcript

[00:00:06] Good morning everyone. Stop. Stop. No. No, we don't applaud that. No.
[00:00:13] The tech went wrong. You don't applaud tech going wrong. You sigh and go, ugh. Can I please have a Gallic shrug from everyone French in the audience? A proper like Ah, quite a few people. Thank you. Anyone else here British? One. Mmm. Suspicious that you're probably a Belgian, not British. No one is dumb enough to come to Europe anymore from Britain. Um, so yes, hello everyone, good morning. Uh, thank you to Sebastien, Yannick and everyone from Flowcon for inviting me here. My name is Steve Smith. Uh, let's get a few things out of the way. Uh, oui, c'est vrai. I am a roast beef. I am sorry for the Brexit. Uh, we have gone crazy. Um, I'd like, uh, the one British person in the audience to keep track of the live news and wave if our Prime Minister changes again in the next hour. A couple of days ago, I have a 10-year-old who said to me, who do you think our next Prime Minister will be next week? And I said, 'Oh, when I was a boy, we didn't ask that question,' but now that's a very, uh, uh, uh, good idea. Alright, so I'm hoping for a really positive reception. Here in France, the sort of positive reception every British person receives when they go under the Channel. Uh, my sense of humor is very British, very dry. It's possible that I don't mean anything that I say today. So I'm going to talk about, uh, you build it, you run it. Sounds great. But it won't work here. Because we're special, Steve. Our company's different, Steve. We have all kinds of problems, Steve, that we don't want to try and fix because they would be hard to fix. Okay, so. I am the head of the scale service at Equal Experts. Um, we're a global network of expert technology consultants. We have a Europe office. So if you're an independent contractor, uh, we actually have a kind of a similar model to Showdo, I want to talk to those guys. Um, but if you're an independent contractor looking for something new, then come and chat to me afterwards. I, uh, as head of scale, spend my days talking to our customers and talking to our consultants across our network about what it means to succeed at scale. And it's very similar to the Flowcon manifesto, talking about empowered teams, talking about flow, talking about product service projects, all of that good stuff. Uh, these are some of my experiences of working at scale. Uh, if I just pick one out, let's see. Uh, the third one, going from one team to one microservice to 40 teams, 120 microservices in two and a half years. Uh, that was a very exciting two and a half years. Um, it was pre-pandemic, so whenever things got really stressful, I used to go and hide in the office bathroom for an hour and just, you know, have a, it was like a quiet room, I guess. A bit like this conference has a quiet room. Of course, now these days, uh, I don't have a quiet room at home because I work from home every day. If my wife finds me in the bathroom, she just rolls her eyes at me and says, 'Why are you in here again?'
[00:03:11] Uh, I'm on Twitter as Steve Smith Tech and on LinkedIn the same. Um, I don't want to talk about my handles, it turns out there's a lot of Steve Smiths out there and they don't want to give up their handles. Um, I've also written a couple of books you might have read on Leanpub, Measuring Continuous Delivery and Build Quality. Um, please buy them, my children need shoes and the UK economy is ruined.
[00:03:34] So. Uh, what are we here for? We're here to achieve a new baseline for success. Uh, the world's getting faster and messier every day.
[00:03:47] Um, we need to deliver customer outcomes faster than ever before. And the way that you do that is you have to achieve three things at once. You need to hire throughput, a greater reliability and a learning culture. By higher throughput, I mean weekly deployments, maybe daily. By greater reliability, I mean consistently achieving two nines of availability and a time to repair in minutes, not hours. And by learning culture, I mean creating an environment in which people can generate insights, implement post-incident actions and share knowledge around your organization. Not just keep that knowledge trapped within a particular team.
[00:04:32] And there's a problem here, isn't there? All of you, you've all been using a central operations team to manage your foundational systems and your digital services. How many people here are doing you build it, you run it today, where a delivery team is actually on call for the thing they build? Put your hand down, you're lying. There's one person here from Spotify, he's possibly telling the truth, but the last time I spoke to Spotify, they were up to 700 teams and nobody actually knows what anyone does there anymore. Everyone just goes 'round and 'round and then they all hang out in the forest in Sweden and say what how great their company is. Uh, it's okay, I don't want to work for Spotify, they weren't going to call anyway. So. The call never comes, the years go by.
[00:05:15] So, uh, by foundational system, what I mean is a self-hosted COTS application, bought off the shelf, running it yourself on-prem or in the cloud, maybe some custom integrations you've built to glue things together. And an operations team is totally fine
[00:05:35] for looking after a foundational system because you need a high standard of reliability, but it doesn't have to be deployed that often, it's not changing that often, there isn't much market demand for it to change very often. But that model doesn't work for digital services. To achieve a higher throughput, greater reliability and a learning culture, all at the same time for the long term, an operations team can't do that. And not because they're evil, not because they're incompetent. I'm willing to bet your operations team is highly skilled and a lovely bunch of people, but they're working in a bad system. And that's what Flowcon and other conferences are all about, helping us to find ways to overcome that bad system.
[00:06:16] Okay, so. We need to adopt you build it, you run it. It's an operating model in which product teams build, deploy, operate and support their own services. At least that's how my colleague Beth and Timmons and I describe it.
[00:06:33] It's got an interesting history, you might be aware of it. Um, in 2006, the CTO of Amazon coined the term in an ACMQ interview. He didn't mean to call it you build it, you run it. Um, it was just how Amazon teams describe what they were doing at that time. And you know, it's a long time ago now, 2006. About 10 Prime Ministers ago in the UK, I think. It's all a bit of a blur. Um, I don't actually know the real number, I made up 10, feels about right.
[00:07:02] In the early 2010s, we had the DevOps Cargo Cult, didn't we? Everyone said, 'You need to have a DevOps team.' You need to have a DevOps team. I'm on a DevOps team. Of course, nobody knew what DevOps meant and it was all a complete waste of time trying to understand what the word DevOps meant. But I like to think that what people were talking about was, you need to have empowered product teams that are able to build, deploy, operate and support their own digital services. And of course, in the late 2010s, we had the SRE Cargo Cult, didn't we? You need to have an SRE team. You need to have an SRE team. I used to work at Google. Well, you don't need SRE and whenever someone says, 'I used to work at Google,' they probably didn't work at Google or they just, I don't know, took out the trash or something there. Um, I'd like to think that when people talked about SRE, what they were really meaning was, you need product teams that build, deploy, operate and support their own digital services at scale. And maybe if you really had some extreme reliability needs, you might have a team do it for you because they're specialists in it.
[00:08:10] Okay, so. Uh, my colleague Beth and I have written a book about, uh, you build it, you run it. It's available on the Equal Experts.com website on this really strange URL. Um, if you come and chat to me afterwards, I'm happy to share it with you. Um, I'll also put the link on Twitter and on LinkedIn. We try to offer a real deep dive into all the principles and the practices behind it and share some of the experiences that Equal Experts folks have had adopting it all over the world. I mean,
[00:08:42] It's an imperfect name for something that's really important, like we have looked for other names for it, but this is the name that most people seem to coalesce around. I, I can't think of a better name, so I think we're all stuck with it.
[00:08:55] Uh, a couple of years ago, a colleague of mine said to me, did you realize that you build it, you run it, was going to be one of the hills that you die on, Steve? And for those people here today who know me, they'll know that I have an awful lot of hills. Uh, but you build it, you run it is definitely one of my top hills. I think that it's a super important thing to adopt, I don't think you can succeed at scale without it.
[00:09:18] It's probably one of my top three hills, right up there with naming your teams after the things that they build. I visited a company recently where they had a team building a data platform. And they said, 'We're thinking of calling ourselves the data platform team.' And I was like, 'Yes.' This is what you should do. Don't name yourselves after a Greek god or a pop band or a cheese. And yes, I did work a company once where we named our teams after cheese, you're looking at the former team leader of Team Mascarpone. Uh, then team name was out of my hands, but the data platform team was going to be called the data platform team. I went back the next day and they said, 'Good news, Steve.' We've called ourselves the Backstreet Boys. So that was, uh, quite disappointing. Okay. How do you know when to do you build it, you run it? And also crucially, how do you know when not to do it?
[00:10:13] In our playbook, Beth and I have devised an
[00:10:18] operating model selector to try and help product managers decide what the right operating model is for the thing that they're building. So let's go through it slowly because it's early and I've not been drinking yet. So on the Y-axis, we've got financial exposure on failure. That's the amount of money that you can lose in a period of time if things go bang. And that's, uh, we've got some relative levels of that, low, medium, high, very high. And they map onto availability targets from 1 and a half 9s up to 4 9s. And, uh, a time to restore from 9 hours down to 1 minute. Uh, per week, I should say. And on our X-axis, we've got product feature demand. That's the frequency with which you need to put new features in front of your customers to satisfy market demand. And again, we've got some relative levels, low, medium, high, very high. And that maps onto monthly deploys, weekly, uh, what was it? Monthly, fortnightly, weekly, and daily. And yellow is when you have a central operations team. And blue is when you have you build it, you run it.
[00:11:32] So what we're saying here is three different things. Number one, if you genuinely have a real need for four nines of availability, like really extreme reliability, uh, then use you build it, you run it, whether it's a foundational system or a digital service. If it's a digital service, if you need to achieve that new baseline of high throughput, high reliability and learning culture, do you build it, you run it, that's why you're pushing over to the right-hand side of this model. If you just need a high standard of reliability, then keep your foundational system with your operations team. One thing we really try to get across about you build it, you run it, is that it does not mean getting rid of your operations team. In the same way that continuous delivery is not about getting rid of your change management team. You build it, you run it is about freeing up your operations team to do what they're really good at, which is monitoring and caring for your foundational systems, reducing the cognitive load on them and reducing their work in progress. And not just chucking loads and loads of stuff at them over a wall of confusion.
[00:12:37] Okay. What does you build it, you run it look like for deployment throughput? Here we've got digital services on top and foundational systems down below. We've got on top a product team building, testing, deploying, launching and reporting their own digital service. And by reporting, I mean some kind of automated mechanism to notify your change management team that a deployment has happened. Like a deployment pipeline is a great automated audit trail. You're probably doing something like that.
[00:13:14] And with this setup, you're able to achieve fast change approvals, fast deployments, a focus on outcomes over outputs. And you have really low knowledge synchronization costs, you don't have to hand anything over between teams, you're just handing over between individuals in your team. At worst.
[00:13:32] With your foundational systems, you've still got your operations team, your delivery team exactly as before. So your delivery team will build or configure and then test your COTS, your custom integrations. They'll then hand over the wall of confusion to the change management team. And they'll be confused and then when they're not confused anymore, they'll hand over another wall of confusion to the application support team, who will also be confused, and then they'll deploy it. And for foundational systems, that is totally fine.
[00:14:05] Okay. So, what does you build it, you run it look like for incident response?
[00:14:11] Again, we've got a product team on top.
[00:14:15] monitoring and supporting their own digital services, they're doing L1 on call, that means they're constantly on call for their own thing, in hours and out of hours. There is no operations team on call.
[00:14:27] And down below, there is an operations team called for foundational systems. If you're doing maximum ITIL, which is lovely for you, you know, you do you. Uh, you'll have an operations bridge team doing the monitoring. And then they'll hand over any problems through another wall of confusion to an application support team who are doing L2 on call. Alternatively, you might merge those into one team, you know, I can't stop you doing that if you want to, go for it. And of course, when the operations team are trying to restore service with a foundational system, if it gets really hard, really complicated, they will call upon a delivery team to do it out of hours. That gets really messy and trying to find a developer who'll answer the phone. And for the developer to help in that situation, it's known as best efforts or best endeavors, or as I prefer to call it, unpaid labor. And in between, you'd think in France that would get a joke, we just had someone up talking about solidarity with workers. But no, I talk about unpaid labor and everyone's like, 'Yes.' Unpaid labor is fine in France. Okay, so, uh, people have lied to me about France clearly. In between the product team doing you build it, you run it and the operations team looking after your foundational systems, you've got a bunch of what I like to call operational enablers. That's your network admins, your DBAs, your incident management. Your product teams use those, rely on those, depend on those in the same way that other operations teams do. I really like those teams, they're going nowhere.
[00:16:04] So, I've spoken with a bunch of companies all over the world about you build it, you run it. And I'd like to share with you some, not all, but some of the objections that people give to me. And what's really amazing is that people give these objections to me right after they've told me how amazing the thing is. What they'll do is, they'll say, 'Steve, we love you build it, you run it in principle.'
[00:16:29] And I've learned over the years that whenever someone says to me, 'I like you, Steve, in principle,' what they're about to say is how much they detest me and everything that I stand for. So, you build it, you run it sounds great. But it won't work here. Because developers won't want to do it. Anyone heard that before, 'Developers won't want to go on call'? Of course you've heard that. It won't scale. Anyone heard that? 'You build it, you run it won't scale, Steve.' Everything's got to scale now. You're head of scale. Everything's got to scale. You build it, you run it won't work because nobody would be accountable. Anyone heard that one? Yes, everyone's heard that one, 'We must have someone accountable.' Uh, jeez, I'm getting worked up. Um, early in the talk as well to be getting cross. There'd be no incident management. Anyone heard that before, 'You can't have you build it, you run it because who would do incident management?'
[00:17:25] Uh, developers will be firefighters. Anyone heard that one, 'You can't have you build it, you run it, yes, yes.' This is going very well now. They'd be fighting fires, not churning out features in some kind of mindless feature factory. And my personal favorite that I always love hearing, 'We can't do you build it, you run it, Steve.' Because we can't hire a DBA for every team. Anyone heard that one before? Yes. Nobody asked you to hire a DBA for every team, yet you're trying to do it. Like, please stop.
[00:17:54] Alright.
[00:17:57] Number one of six, you build it, you run it sounds great, but it only in principle, Steve. Because developers won't want to do it. Now. What does that sound like?
[00:18:11] It sounds like, developers are divas, Steve. They just want to code. They're paid to code only. They need to be left in their cupboard, coding. I've never spoken to them. I don't know their names. But they won't want to do it. Whenever I hear that from a head of development or a head of operations or a head of IT, what it sounds like to me is, you haven't given your developers what they need to succeed. Have you actually explained to your developers that there's urgency around this, that this is about survival if you don't achieve this new baseline for success, you're going to lose out to your competitors in the marketplace?
[00:18:59] Have you asked your developers what's wrong with going on call and listened to what they say? And I don't mean hearing the same way that my children hear me when I say, 'Stop eating the chocolate in the kitchen,' and then they still go and do it again. I mean hearing what they say, clarifying your understanding of it until you really deeply appreciate what their concerns are. And I'm pretty confident, I know what concern number one is, we'll come to in a moment and it's not money. Okay.
[00:19:33] And have you committed to putting things right? There'll be some really tricky organizational changes to fix if you want to adopt you build it, you run it, and it's important to give it maximum effort and maximum transparency.
[00:19:46] I really like Sebastian's sharing the conference spends, like sunlight really is the best disinfectant, okay, it drives all kinds of good behaviors when you do max transparency.
[00:19:57] Okay. stop eating the chocolate in the kitchen and then they still go and do it again. I mean, hearing what they say, clarifying your understanding of it until you really deeply appreciate what their concerns are. And I'm pretty confident I know what concern number one is. We'll come to in a moment and it's not money.
[00:19:32] And have you committed to putting things right? There'll be some really tricky organizational changes to fix if you want to adopt you build it, you run it. And it's important to give it maximum effort and maximum transparency. I really like Sebastian sharing. The conference spends like sunlight really is the best disinfectant. Okay, it drives all kinds of good behaviors when you do max transparency.
[00:19:58] Okay. How do you give your developers what you need? Well, hopefully, there's a lot of clues out there now. There's a whole bunch of surveys that we didn't used to have about you build it, you run it. In 2022, Atlassian have done a survey of 2,000 developers across four different countries, I think, showing 59% of developers are on call. And there's a strong correlation between you build it, you run it and job satisfaction, despite the extra context switching. There's been a 2022 instant.io survey that again showed a whole bunch of developers on call. And it identified the number one concern of people going on call. And it's the same concern that I hear when I speak to our customers time and time again, and it's the impact on your personal life. On call is a social sacrifice and that's something that you have to respect, okay? There's a great quote in the survey. Not being able to live life as usual, no drinking, no long bike rides, having to carry a computer everywhere. How many people here have been on call and can relate to that problem? Yeah.
[00:21:11] So, I've been on call before, and I remember the stress of like driving on a motorway, thinking, what happens if I'm called now, there's no where I can pull over. Of rearranging holidays because I know I'm going to be on call on a particular weekend. Of going to a cinema with my wife and there's no 4G reception and stressing that I might get called and not know about it. Like, this is the big concern that people have. And surprise, it's the same concern your operations folks have had about going on call. So it's worth speaking with them about how they've overcome this particular problem. For me, there's a couple of things you need to do. The first one is really to acknowledge the impact that on call has on personal lives and share data. You know, again, it's about being transparent, right? You want to share the frequency of incidents, their duration and how often they happen. Perhaps there's a high percentage of incidents that happen during the day rather than at night, and that would put people's minds at rest if they knew that.
[00:22:11] You need to create time and space for on-call onboarding and on-call training. You want to put yourself in a position where when somebody goes on call, they feel confident, that they feel like they have a wealth of organizational knowledge at their fingertips. For example, you want your front-end developers to feel confident resolving back-end problems through a run book and vice versa. That's a really nice test to have. And you need to compensate people for being on call. And I don't mean paying them for call-outs. I mean paying them for being on standby. Pay them each night that they're on call. Pay them each evening that they're on call, pay them each, uh, weekend, each bank holiday. Like if if you don't compensate people for changing their home life to fit around work, somebody else will. I've I've seen it happen. It can often be the straw that breaks the camel's back and affect employee retention.
[00:23:13] There was a 2019 on-call compensation survey run out of our community group that's available online. And it's got some really good information on how much people are paid and how companies remunerate on-call. And there's a lot of variation. Most companies are doing a flat rate, but it's not straightforward. So there's no magic number here. You've got to figure out a remuneration structure that works well for your company. It would be dishonest of me to say, you know, here's a magic number.
[00:23:46] Okay. It sounds great, developers on call. But it won't work here because it won't scale. Everything's got to scale. What does this sound like? It sounds like, we've got 20 teams, Steve. How can we have 20 developers on call? How could that ever be cheaper than having one operations person on call for the 20 teams? I can outsource that one person, Steve. I can put them on St. Helena, Steve, it's where the British exiled Napoleon, it's a really shit island, it's really cheap. When I hear that, I think, you know, a lot about Napoleon, and then I think, you haven't balanced financial exposure with on-call costs yet.
[00:24:36] You can think about an operating model as an insurance policy, as a multi-cost insurance policy. You know, there's a run cost and in return, uh, you know, there's an opportunity cost that you can mitigate. And really, it's about protecting your revenue, right? It's about minimizing operational costs. There's a good analogy with home insurance. The more valuable the contents of your home, the more of a premium you pay to protect the contents of your house because if something happens, you want your contents replaced quickly. And you want them to be, you know, of a similar quality.
[00:25:15] What does that mean? It means, yes, you build it, you run it will be a bit, a bit more expensive on run cost than a central operations team. But when you look at opportunity costs, when you look at revenue protection, you'll see that for digital services, it is superior to having an operations team. Hmm.
[00:25:40] Have you forecast how money flows through your teams and through your services? This isn't as hard as it sounds. There'll be some business cases lying around, there'll be some revenue estimates and cost estimates lying around. You can estimate broadly which of your services have the most money flowing through them and at which times of the day. And then you can validate that information over time with real incidents, as long as you track their financial impact. And also with chaos days.
[00:26:14] And that will help you understand that not every service is the same and of the same importance.
[00:26:22] And have you let go of this absolute nonsense that everything everywhere must always be on? You'll have a few digital services with a lot of money flowing through them in the daytime and a lot of money flowing through them in the nighttime. So always have somebody watching those.
[00:26:42] You'll have the majority of digital services, I'm willing to bet. that have a lot of money flowing through them in the daytime, but less in nighttime. So have someone watching, but you don't have to have one person for every service.
[00:26:56] And you'll have a few services that don't have a lot of money flowing through them in the daytime and very little at night-time. So don't have anyone watching those. It'll be fine.
[00:27:11] Okay. How can we balance financial exposure and on-call costs? Now, remember, I did warn you about my dry sense of humor, so no one can come up to me afterwards crying that they didn't understand what I was doing. I'm going to show with you now some real experiences from different organizations around the world. But what I'll do is I'll anonymize them by changing them to, each one will be like a different country, okay? Like really far away countries no one's ever heard of. And that way, I'll be safe, and I won't get sued, okay? So, here's an example from a home improvement retailer in France. It wasn't really France, but I like France. I came to Paris with my wife once, and on the Eurostar I said to her, 'Remember, darling, when we go into a shop, you have to start every conversation with bonjour'. Otherwise, we won't get served, and she said, 'You're an idiot, that won't be true of the capital city'. So we went to a boulangerie, the man came up and looked at us inquisitively. And then my wife said, 'Hello', and after 20 minutes, she said, 'Why haven't we been served yet'? And I said, 'Shall I paint you a picture'? Okay, this particular retailer that's definitely not in France, had 25 teams and wanted a 'you build it, you run it' model that minimized run costs without weakening reliability incentives for developers. Because that's what you build it, your run it is all about.
[00:28:46] You want to ensure that product managers are incentivized to prioritize operational features alongside product features. You want to ensure that developers are incentivized to build a service that can gracefully degrade on failure. Because if they make a mistake, they're the ones that are going to get woken up at 3:00 a.m.
[00:29:07] So, this is the model that I constructed to balance exposure and on-call costs. It doesn't show all the teams, but a few of them just to give you a flavor of it. On the Y-axis, we have financial exposure on failure, just like our operating model selector. And again, we've got relative exposure, low, medium, high, and they map onto different availability targets. On the X-axis, we've got the time of day, on the left it's in working hours, and on the right it's out of hours. And it's that flip from in-office hours to out-of-hours where you see a dramatic change in how you build it, you run it works. We've got five teams here.
[00:29:50] The team with the critical service is the Cloud Search team. Then, at the medium level of availability, which I think is two and a half nines for this company.
[00:30:02] There's uh, what are they? There's the Outdoors team. See, when you disguise examples, you end up thinking of the real one in your head, and I mustn't say the real one. Uh, there's an Outdoors team, a painting team, and a furniture team, and they've got four services between three teams there. And they all belong to the same product domain, which is called customer journeys.
[00:30:21] And then down at the bottom, there is the unloved store operation service, and that belongs to the store operations domain.
[00:30:31] During the daytime, all of these teams are on call for their own service. They're all set up and page you too. There's one person on call at all times in office hours.
[00:30:47] So, five teams, five people on call.
[00:30:51] But out of hours that changes from bottom to top.
[00:30:55] The store operations service has a low availability target. It has a small amount of financial exposure. Out of hours, there is nobody looking after it. If there is a problem with it in the middle of the night, it stays that way until the morning when the development team start work. And when I say no one's looking at it, I mean nobody. It's been designed in such a way that the operations team cannot be given this, they cannot look after it. And that's done to protect reliability incentives for the development team even though they're not on call at night. They know they're on call during the day, and they know if there's a problem at night, they'll have to fix it when they come in in the morning.
[00:31:39] Okay, moving up to. the customer journeys domain, out of hours, there's a rota between the three teams. One person from the three teams is on call each night. So, tonight, somebody from the Outdoors team, the Painting team or the Furniture team will be on call for all the services from all of those teams. These teams have a similar tech stack, they collaborate quite closely.
[00:32:05] And working with a rota based on product domains creates a natural affinity around customer outcomes. You can do other affinity groupings like technology stack or location geography, but the most effective one I've found is to use product domains.
[00:32:24] And this isn't perfect, like people are a bit nervous about someone from a different team looking after their service, but it's a sibling team, it's a team that they spend time with, and people understand the mission to achieve you build it, you run it without having, you know, crazy run costs. And up at the top, the Cloud Search team, their service can lose a lot of money at any given moment. If search goes down, trade drops because people aren't searching for things to buy, obviously. So that team always has somebody on call at all hours.
[00:32:56] So, like I said, this model isn't perfect, but it works. Five teams during the day, five people on call. At night, two people on call. This is one way and there are others to scale you build it, you run it. There's plenty more um information about this scaling model in our playbook.
[00:33:21] Okay. It sounds great. But it won't work here because nobody would be accountable. There's got to be one person that's accountable. What does this sound like?
[00:33:37] We must have one throat to choke, Steve. We must have one person on the hook.
[00:33:46] It would be the Wild West if developers are in charge of things themselves.
[00:33:52] When I hear this, what I think to myself is, you haven't tried trusting your people to do the right thing.
[00:34:02] Have you explained to your senior leadership, to your managers, to your team leads that these governance changes are an essential part of the mission that you can't do you build it, you run it without changing how governance works? That without trying you build it, you run it, without giving it a good go, you won't be able to achieve this new baseline for success.
[00:34:24] Have you informed your senior leadership of the importance of product managers mapping customer outcomes onto operational objectives as well as functionality? This is super important because when a product manager prioritizes reliability alongside functionality, like good things happen. Teams aren't feature factories, teams actually start to think about how to build in graceful degradation. Because they're given the time and space by the product manager to actually do it.
[00:34:57] And have you described to your teams how these devolved responsibilities will actually work? Have you encouraged your tech leads to spend time with your product managers, to lend their technical know-how to them, to help them estimate financial exposure, to help them estimate feature demand?
[00:35:17] All right, how do you trust your people to do the right thing?
[00:35:25] This is a raci model for an online auction site in France. But not really in France. Although I do like France. Uh, my family and I once went to a wine festival in the French countryside, and it was really exciting, ended the wine season, they had all the trestle tables, and everyone sat at tables. And there's a guy with a microphone, and he says, 'Who here's from France?' and loads of tables stand up and everyone was like, 'Yeah'. And then he says, 'Anyone here from Canada?' and one group like, 'Yeah, we're from Canada', was like, 'Yeah'.
[00:35:59] And then just as a joke, thinking nobody would say yes, he says, 'Is anyone here from Great Britain?' and my family stand up, and everyone goes, 'Boo'.
[00:36:10] That actually happened in France.
[00:36:15] So. What's a RACI model? It's about who's responsible, who's accountable, who's consulted and who's informed when you're making a particular change or this particular problem in your company.
[00:36:28] I'm willing to bet your company has a RACI model somewhere. Governance people love talking about RACI models. So why not use tools that people are familiar with, why not use the language that people are familiar with to talk about changes you want to make, to make that change easier for them to appreciate. So, this RACI model had one category for reliability. And it covered topics like availability, support, telemetry, and on-call budget. The company has a head of operations, a head of product, and the IT department is split entirely away from products, there's a head of delivery within IT.
[00:37:10] So what I did was split the RACI model, so there's one category for digital services and one category for foundational systems.
[00:37:19] So if we do the bottom row first, foundational systems.
[00:37:23] We didn't change this. The head of operations is still accountable for the reliability of all foundational systems in the company, one throat to choke, as British people sometimes say, which is really weird. And that head of operations delegates responsibility for reliability down to their application support team. And the head of product, the head of delivery, and the product teams, they're informed when changes have happened. I'm pretty confident everyone here is familiar with this way of working. With digital services, it's different. And it's slightly complicated because there's a head of product and a head of delivery.
[00:38:09] So if a product team is working on a new customer proposition, it's funded by the head of product, so they are accountable for reliability.
[00:38:19] If the team is uh re-platforming or transforming an existing service, then that would be funded by the head of delivery out of IT, so they would be accountable for digital service reliability.
[00:38:35] And what we're doing here is we're being really clear that it's the budget holder that's accountable for reliability, okay?
[00:38:45] In both cases, you push the responsibility for reliability down to the product teams. Okay, and then the head of operations and the application support team, they're informed when there's been a problem.
[00:39:01] This has a number of benefits. The big one is that teams carefully consider risk tolerance versus engineering effort when they're choosing an availability target.
[00:39:13] When you're on a delivery team and there's an operations team that's going to support your digital service.
[00:39:21] If you are asked, and you're probably not, what availability target would you like? Your answer is 5.9s availability because it's not you that has to support it, so why wouldn't you ask for 12 golden terminators to look after your thing while you sleep? But when you're the one that has to pay for the engineering effort, and you're the one that has to do on call, something interesting happens. People start to think about, how much downtime can we actually have with this thing? How much money would we lose? How much money do we want to pay to protect that revenue?
[00:40:00] Another big benefit here is that teams start to prioritize failure design alongside product features. And it's okay with their product manager, they'll start to do things like back pressure queues, caching, circuit breakers, bulkheads, all of that good stuff, because now time and space will be granted to actually work on those things.
[00:40:22] Now, full disclosure, this is really hard to do. I'm not going to pretend. It's really hard to move accountabilities between different heads of. It's really hard to split an OPEX on-call line item into two different CAPEX line items from on-call.
[00:40:44] It's hard, and it's the right thing to do, okay? Just because something's hard, doesn't mean you shouldn't try. and it's okay with their product manager, they'll start to do things like back pressure cues, caching, circuit breakers, bulkheads, all of that good stuff. because now time and space will be granted to actually work on those things. Now, full disclosure, this is really hard to do. I'm not going to pretend. It's really hard to move accountabilities between different heads off. It's really hard to split an OPEX on-call line item into two different CAPEX line items for on-call. It's hard and it's the right thing to do. Okay? Just because something's hard, doesn't mean you shouldn't try. Okay, number four. It sounds great, but it won't work here because there'd be no incident management. What does that sound like? Developers would abandon incidence, Steve. They'd be woken up, they'd look at a problem and go, 'I can't fix that,' and they'd go back to bed. We'd bankrupt ourselves while we sleep, Steve. Well, that sounds to me a little bit like you haven't made incident management self-service yet. Have you connected your incident managers with your developers? Do they appreciate their code dependency? Incident managers can't have a quiet life without developers. Developers can't have a quiet life without incident managers. Incident managers or the person in your company responsible for incident management compliance who thinks they aren't an incident manager, but when they're auditors come around you say that's the incident manager. They're super useful, right? They've got organizational skills, communication pathways, stakeholder management skills that are super useful and, you know, really handy to developers when you're losing an awful lot of money in a hurry and people are jumping up and down saying, 'What the hell's going on?' You want your incident managers to trust that your developers will phone them when they're in a pickle. You want your developers to trust that your incident managers will jump in and help them when they're in a pickle. Have you mapped out how your incident management process works as is and as is intended? Just treat it like a value stream mapping, like anything else, right? Figure out the manual, the semi-automated activities you've got, all of those dirty spreadsheets you like to pretend don't exist, and just plan to automate it all. Okay? Take out all of the handoffs, all of the confusion, all of the repetitive actions. And have you run some chaos days? You don't have to go all Netflix and build some kind of crazed automated monkey that tears down a AWS regions every time you click your fingers. You can run chaos days in a test environment. You can run them in the style of exploratory testing, just find the most knowledgeable member of the team, make them the chaos agent. and set some ground rules and a blast radius and let them wreak havoc and you'll learn an awful lot about your instant management process. You'll learn a lot about how people can work together when there's a problem. You can gradually build up confidence in integrating instant management into, 'you build it, you run it'. Okay. How can you make incident management self-service? I spent some time with a broadband Telco once in France. It wasn't really France, but I do like France. I checked into my hotel yesterday and the receptionist said to me, 'Breakfast will be at 6:30 a.m.' And I said, '6:30? Is that a time now?' No one in Britain gets up at 6:30 anymore post-pandemic, we're all in bed at 6:30. And she gave me this disgusted look and said, 'Someone's going to eat your breakfast at 6:30.' So, I'm really hungry. Uh, that actually happened yesterday morning. No, yesterday evening, excuse me. Uh, at this broadband Telco, they used to have an operations bridge team who would receive alerts for services they couldn't possibly understand. And then they had a spreadsheet of phone numbers they could call when there was a problem. And of course, the spreadsheet was out of date. It was missing some phone numbers, some phone numbers were wrong, and they would blindly panic and just phone numbers until somebody answered a phone and said, 'I'm acknowledging the incident and I'll try and fix this.' They often had incidents lasting an hour or more, and up to 15 minutes of the incident was an analyst just trying to find somebody who would answer the phone and actually work on the problem. So, I was brought in with some other folks from equal experts to automate out their incident management workflow, and we persuaded the company to invest in PagerDuty, which is a really nice tool. I really like PagerDuty. Um, it's pretty often, it's rare that I recommend tools, but, um, it works well. So, what happens now is there's an automated alert, it goes into PagerDuty, which automatically creates an instance in Service Now, and there's bidirectional sync, so changes in PagerDuty are reflected into Service Now and vice versa. Uh, there's an incident channel created automatically in Slack, and a developer's page by uh PagerDuty. Now, that's all pretty standard, but what's really nice is the incident managers are modeled as a team in PagerDuty. So, if the developers are getting into trouble, if the incident is beyond them, they just add the incident management team into the incident. The incident managers are paged, they wake up, they can look in Service Now at all they're very familiar with, for a ticket with all the necessary information, the name of the developer, the phone number of the developer. And they can phone the developer and say, 'Hey, how can I help?' This isn't hard to implement, the trick is persuading developers and incident managers to work together and also to carve out some OPEX budget for uh PagerDuty itself. Okay. It sounds great, but it won't work here because developers would be firefighters. This sounds like developers would spend all their time on BAU, Steve, BAU, BAU, BAU, they'd never work on features, they'd never churn out new features. We need to churn out features every day, Steve, morning, noon and night. Features, features, features, features, features. This sounds to me... like you're not measuring or eliminating business as usual, maintenance work. You sound like you're frightened of the thing. So why not measure it and then get rid of it? Have you learned to manage unplanned work? Are you tracking BAU work items in your ticketing system? It's very simple. Okay? If a BAU task comes up like fixing a defect or fixing a broken build or adding infrastructure capacity, cloud or on-prem, I don't care. If it takes less than half a day, just do it. If it's more than half a day, create a ticket, put it in Jira, give it a little label like BAU or something, and then let your product manager prioritize it with you the following morning. Have you visualized your per team rework rate? This is out of the accelerate book, rework rate is the percentage of time that a team spends on unplanned maintenance work. And it's easy to calculate if you're storing tickets in Jira with a little BAU label, just count the number of tickets each month that are completed with that BAU label. And then you can graph for each team whether their amount of BAU work is going up or going down. And have you built paved roads? This is the Netflix term for a self-service, fully automated user journey. The Spotify term is golden path. And with Equal Experts, we call it um, paved roads as well. And they're BAU killers, like they eliminate so much pain from developers' lives because they make so many user journeys fault-free, frictionless, like they're absolutely the right thing to do. Okay. How can you measure and eliminate BAU work? Here's an example from a commercial television network I visited a while ago in France. It wasn't really France, but I do like France. One time I came here and I enraged a waiter by asking for a burger that was very well done. It then of course came out medium, which was exactly as I intended. The waiter was disgusted, he was almost thinking like, you know, the roast beef's well done, the burger's blue, it should be bleeding. This is a strange, strange country. Um, so, uh, what have we got here? We've got some graphs from this television network that totally isn't in France. And in yellow, there is a legacy network platform built on-prem, a cot system with a delivery team and an operations team looking after it together. And blue is a scheduler service built by a team doing, 'you build it, you run it.' We'll look at the left graph first. This is deployment interval, the number of days between live releases. You can see a flat line for the scheduler team, that's very good. They averaged a deployment a day for two years, which is exceptional. The network platform team, their average deployment interval varied from two weeks to four weeks to six weeks. And you can see some gaps between deployments. What were the gaps caused by? Change freezes, that's right. Everybody loves change freezes, I miss the old days where we had more of them, I miss them. And the right graph, we'll go to next. This is the time to restore, two years of restored data, okay? I'm going over five minutes. You're not going to come up here, I'm big. Uh, so, the Y axis is uh up to four hours. You'll see in yellow a bunch of little dots that mean there were a whole bunch of incidents within their network platform and they took, you know, hours to restore. The blue dots show that there were fewer incidents for the scheduler service and they were faster to restore. And the middle graph shows BAU as a percentage. And you'll see it's really high for the legacy network platform team. Up to 75% of their time was spent fixing stuff. Only 25% of their time was spent building new things. For the scheduler team, it was only 10% of their time spent on BAU work. And the biggest difference between these two teams was that the scheduler team was empowered to do something about BAU. They were able to make changes as they saw fit to reduce BAU work as it came in. The legacy network platform team wanted to do something about it, but they weren't able to. All right, last one. My favorite. You build it, you run it sounds great, but it won't work here because you can't hire a DBA for every team. Nobody asked you to. All right, this sounds like DBAs are so expensive. They're so hard to find, Steve. We can't embed them onto teams, they'd have nobody to talk to, they'd get lonely. We can't have developers querying the database, the last time that happened it crashed. I don't know what my DBAs are called, one of them I think is called Katie, that's what I call her. Uh, this sounds to me like you need to get to know your DBAs better. But it also sounds like you haven't made repeatable specialist tasks self-service. Have you understood that hiring a DBA for every team is not 'you build it, you run it'? This is an unnecessary extreme. Okay? You build it, you run it is about cross-functional product teams solving problems for themselves, but you can still have your DBAs as an operational enabler, as a central team. There's nothing wrong with that. Have you rejected this idea of embedding specialists in teams? You don't have to shove DevOps people into product teams, they wouldn't know what DevOps was anyway. You don't have to shove DBAs into teams. I know that if you add more product teams, a central DBA team will become overloaded with work and may start to burn out and have a high cognitive load. I also know that if you cram a DBA into every team, they won't feel like they're part of a tribe. Their workload will vary from huge to nothing and they'll end up being spread between multiple teams, like that's that's not a good thing. And have you automated repeatable specialist tasks? What you want to do is follow the general continuous delivery principle of getting the machines to do the toil so that the humans can do the hard higher order stuff that machines are really bad at. So just build some deployment pipelines, build some more paved roads. Okay, how can you make repeatable specialist tasks self-service? Here's a task mapping I helped a financial services company with in France. It wasn't really France, but yesterday at Gare du Nord, I went into a shop and said, 'Bonjour,' because I know to say 'Bonjour' first. I said, 'Bonjour, do you have any UK to Euro plug adapters? I've forgotten mine.' And the lady looked at me very sympathetically, and I thought she was going to say, 'We over there.' But she didn't say that. She said, 'No, not since Brexit.' And I said, 'Je suis désolé pour le Brexit, nous sommes devenus fous.' And she said, 'Say sorry to yourself.' Uh, it's fine. She seemed nice. So, here is um, the way that I helped this financial services company to break down what their DBAs were doing and what they should be doing, okay? The as-is versus as-intended again, okay? So, we've got some different categories of tasks here from top to bottom. First of all, repeatable low-value tasks. Okay? Offload these to your cloud provider. Okay? As much as your conscience will allow, don't overthink it, just do it. So in the case of this company, they had a on-prem PostgreSQL relational database that a DBA had to babysit every day. And they migrated it into AWS Aurora. So at a stroke, you don't have to do disaster recovery anymore. You don't have to do backups anymore, you don't have to do infra capacity anymore. There was resistance to doing this, people worried about the migration cost, it took quite a long time for it to happen. But compared to the cost of hiring more DBAs, compared to the cost of DBAs leaving because their workload's getting too high, like it pays for itself. You just have to take that longer view on it. Okay, next, repeatable high-value tasks. Okay, things that development teams, you know, really need done, but they keep cropping up. You can build some paved roads for this, right? You can build some self-service deployment pipelines. Uh, an easy example here is like creating a database schema. That doesn't need a DBA, right? It needs a DBA's permissions, which developers shouldn't have. So build a pipeline for that and lock it down, put some monitoring on it. And after the developers run the pipeline to create a database schema, just send a little email to the DBA saying, 'Hello, a new schema has been created.' And finally, the ad-hoc high-value work. This is where your DBAs really shine. This is where you want them to help developers to understand indexing performance problems, problems with live user data, the things that developers are really bad at. This can work really well, but again, it's going to have some really tricky organizational changes. Your DBAs will freak out of this a bit. They'll freak out at not looking after the database anymore. They'll freak out at not creating schemas anymore, because that's what they've always done. But the important thing is to sit them down and help them understand that what you want is in their brain, not typey type. Okay, and get your cloud provider to do as much of the typey type as possible so you can actually access their brain. Okay. So to conclude, all of these objections to 'you build it, you run it,' they all really boil down to the same thing, right? They boil down to, 'We haven't done this before.' And I, I totally understand that, like change is hard, okay? It's really important that you share the mission with people, you inject some urgency into it and tell them this is about your company surviving, about delivering customer outcomes faster than ever before. You need to pick a pilot team, like a Goldilocks team, you know, not too hot, not too cold, just right. So find a team with some dependencies, but not too many dependencies. Find an important team, but not too important. And help them become an exemplar for your other teams. Don't start out with every team trying to build it, run it, give one team three months to see how they go. And you need to change the mindset in your company. You need to help people understand that it's okay to make mistakes. It's okay to learn from mistakes, that you don't necessarily know how you're going to get to your destination, you just got a rough idea of what it's going to look like. All right, so finally, you build it, you run it, is great. It can work here, wherever here is. Possibly in France. You need to give your developers what they need. You need to balance financial exposure on call costs. You need to trust your people to do the right thing. You need to make incident management self-service. You need to measure and eliminate BAU work. And you need to make repeatable specialist tasks self-service. Thank you very much for having me, and if there's time for questions, I'm happy to take it. Thank you.