Steve Janaway
Transcript
[00:00:01]
Good morning everybody. Hello. Good morning. Uh, so I'm here to talk to you about today about uh, production support. Uh, there is a bit of audience participation in this, so if you are still feeling a little bit tired and you've come in and you think, oh, I'll sit there and I'll fall asleep.
[00:00:22]
Uh, maybe not. Uh, I'll take questions at the end. So if you've got any questions as we go through this, just store them up, there's plenty of time at the end for them. So, uh, yeah.
[00:00:34]
Hi, I'm Stephen. I, this is my first time at Flocon, so, uh, thank you so much to the organizers for inviting me along. I've had a, I had a great time yesterday. Hope you all did too. Hope you're looking forward to a really good day at the conference today as well. Uh, if you have any questions or anything you want to kind of discuss with me after this and you don't manage to catch me at the rest of the conference, this is my Blue Sky, this is my LinkedIn. So, uh, please, uh, send any any questions or anything you want to talk about over to that. So, I'm the VP Engineering at a company called Bloom and Wild. We are a flower delivery company. We have three brands, our French brand is called Bergamot. Uh, you may have had I can see someone nodding. Excellent. Somebody's heard of us. Brilliant. Uh, and yeah, I'm VP Engineering there, I have been for almost almost eight years, and so what I'm going to talk to you about today is a bit of a story about production support. And how we've handled things from when we've been very, very small as a very small startup to a much larger scaled up company.
[00:01:39]
So, when you think about support, support means different things to different people. So, to some people, it means this.
[00:01:50]
Sometimes turning it off and turning it back on again is actually a really good thing. Uh, I was catching the train to, uh, to work, so I work in London, I was catching the train a couple of weeks ago. And the train pulled into a station maybe about 30 minutes from London and it stopped. And it sat, normally it sits there for about maybe two minutes, it sat there for 10 minutes. And then the guard came over the tannoy and said, there's a problem with the train. And we sat there.
[00:02:18]
And then all the power went off. And it went very quiet. And then about a minute later, all the power came back on again, lots of things started beeping. And then we carried on on our journey to London and it was all absolutely fine. So sometimes turning things off and turning them back on again does actually work, even on something as big as a train. But not here to talk to you about that sort of support today. I'm also not here to talk to you about generally how we support each other as as humans and as individuals.
[00:02:50]
Support might also mean something like this, so maybe supporting a a sports team. This is this is actually my local team winning promotion. So a place called Woking in the, in the south of England. Which, uh, I'm sure many of you have not heard of. We, we play five leagues below the Premiership, so we're very small. No, what I'm what I'm here to talk to you about today is production support in the context of how we develop and how we deliver and how we support software. So, if you think about how we, how we, how we design, we de we test, we deploy our code. We operate and we monitor it. And if we're doing it well, it's the same team operating and monitoring it as designed it, wrote it, test it, tested it and deployed it.
[00:03:42]
And then we support it.
[00:03:44]
And it's this last bit that I'm here to talk to you about today. So, it's a story about how we supported our customers and our internal users and our software at Bloom and Wild.
[00:03:58]
Now, what I'm not here to do is to tell you about how to run your software in production, whether it's a good idea, whether you should uh, support uh, run it in production, or anything like that. A lot of other people have done far better talks at conferences about how you build it and how you run it. This one from Steve Smith, for example, is very much worth watching. No, what I'm going to tell you about is how we supported our software. And a lot of teams kind of struggle when they support software to move from a centralized model of support to a distributed model. We've kind of had the opposite problem, which I think's also quite interesting. So, So at its heart, this is a story about how a talented group of engineers here help get lovely flowers like this here to lovely customers here. Pretty simple.
[00:04:56]
And it's a bit of a story about how this guy, who this guy is as well. Now, I called this doing the gardening at Bloom and Wild. And so this is going to be a story about gardening.
[00:05:10]
But it's going to be a little bit of a story about how one might also do gardening. So think of this like a gardening journey, right? So, when you start, maybe you get interested in growing flowers or growing vegetables. Maybe you start with something on your windowsill, maybe you just have a little bit of room, and maybe you grow some herbs or maybe you grow a chili plant or or something like that.
[00:05:37]
So, I thought I'd take the advantage through this presentation, not only to teach you about how to support your software in production, but also to teach you a little bit about flowers too. So this is where the audience participation bit comes in, right? We're going to play a little game as we go through here of guess the flower. I've also tried to translate these flowers into French, I hope I've got the translations right. I will see from your faces if I've got them wrong and I apologize if they are. So, let's start quite easy. What flower is this? Anybody want to be brave and tell me? That's correct. That may be the, is that the correct French translation? Excellent. Brilliant.
[00:06:18]
So coincidentally, you can, if you like daisies, you can get them in this lovely bouquet of flowers from Bloom and Wild.
[00:06:26]
So so daisies are small flowers. And Bloom and Wild started small. So we had a small Ruby on Rails monolith, a compelling proposition in a single country. This is our first website. It looks very, very different from our current one. But like any startup, that means a small team, a few engineers, really close cooperation, really super fast feedback loop between everyone in the business.
[00:06:55]
And this is what we look like.
[00:06:58]
So this is about the point at which I joined Bloom and Wild about eight years ago. And this, the first phase, when you're in this sort of phase, it's really people just know what's happening. Things happen through almost through osmosis. There's very little policy, very little procedure. There's very little day-to-day management because there's just a North Star. Everyone's, everyone knows where they're heading, everyone knows what they need to do, they're doing their thing in their own little swim lane, but getting towards that, that North Star. And the most important thing engineers can be doing at this stage in a startup's journey is shipping code. Learning from what they ship, course correcting and repeating. Making mistakes, recovering from those mistakes. Now, at this stage, we had a mixture of internal users on our platform, using it for CRM, for marketing, for range management, for fulfillment, for customer support. It was your classic, a monolith will do everything.
[00:07:59]
And how you support your code in production at this stage is is is pretty easy. When something breaks, you fix it. You've got a very small number of engineers, when a key system goes down, all of these engineers mob on the problem and and the problem gets gets rapidly fixed and things get back up and running.
[00:08:19]
And in that model, support can be really, really ad hoc. So you can shout over the virtual or actual cubicle wall, depending on how your startup is set up. There's very few people, that feedback loop is really, really short. You don't need process because everyone's super engaged in what they're doing.
[00:08:41]
And when you get reports in from customers, it's very likely the head of customer support is a single person. And they're probably here. Or you're all in one Slack channel. So it becomes very, very easy if something's wrong, someone shouts at you, you fix it.
[00:08:59]
And this kind of brings us to our first learning. So our first learning here is, when you start off and when you're small to support your code in production, you don't actually need very much. And the trap here is to try and introduce too much process and too much bureaucracy. If you introduce too much bureaucracy and process at this stage, you're seen as slow, you're seen as getting in the way, you start breaking those super short feedback loops. Because like life in a startup does feel crazy and out of control at times. And that's actually almost deliberate. It's a bit of chaos for moving quickly, finding market fit. And this is the stage, particularly as an engineering leader, to start building the relationships because in a startup, it's really, it often happens that those who are there at the beginning go on to become senior leaders in the future organization. So that head of customer support who may only be one person may end up heading up a customer support team of 50, 60, 70 people in a few years. And so building those close relationships is super, super important.
[00:10:02]
Right, so let's say you've moved on in your gardening journey. So you've moved on from your windowsill, you've maybe got a terrace or a little patio, you've got some pots on it, you're growing some slightly bigger plants in them. Let's go for a slightly harder flower to guess. So, does anybody know what that is?
[00:10:28]
I like your thinking faces. This is brilliant. I'm going to put you out of your misery. It's it's a stock. Which apparently is the same in French. Which you can get in this lovely bouquet of flowers from Bloom and Wild. Can you see a theme here?
[00:10:42]
No, so, so in our journey with Bloom and Wild, we've moved on a bit. So we've scaled up a little bit. We've launched a proposition in Germany and a proposition in France under the Bloom and Wild brand. Uh, we've launched mobile apps. We've grappled with the tricky problems of localization and and different currencies that we didn't have to worry about when we were only selling in England.
[00:11:08]
The team's got bigger. Although as you can probably see from here, we've adopted a very non-scalable naming convention for our teams.
[00:11:17]
The third team was not called and.
[00:11:20]
Uh, But we we make a big push for localization at this point, a big push for internationalization. We hire more engineers, this project is super, super challenging. And the quality of what was delivered was was initially really not particularly good. And what that led to is it led to a lot of kind of difficult behaviors. So a lot of other parts of the business back channeling engineers in Slack channels to try and get things fixed because they had their favorite engineer.
[00:11:53]
Uh, we we tried to solve that problem by making a single Slack channel for support. That got very chaotic, very quickly. Because everybody would come in with different things that they thought needed supporting and it was and and it got very, very uh, hard to prioritize. So this was the phase to really start to establish some trust between the engineering team and the rest of the business around production support.
[00:12:22]
So enter this little guy. So this little guy is the guy we call Supporty. And Supporty is is is a little, he's a he's a little Slackbot, uh, written in in TypeScript. And we introduce Supporty to try and bring a bit of order to the chaos. Because the problem with having all of our support in a single channel and all of our support requests in a single channel, there was no effective prioritization mechanism. It was often really unclear to the engineers what they should be working on and when they should be working on it. I couldn't see what the engineers were really working on.
[00:13:02]
So I didn't know whether they were working on the right things. It was it was really, really hard to get any sort of statistics to learn from. About what people have been working on. Sometimes the wrong things got worked on. And sometimes only support things got worked on, which when you're a startup with very, very limited resources, is not a good thing. Because that's when you get the CEO tap you on the shoulder and say, hey Steve, that feature, that's that that was supposed to be shipping last week, that's not shipping. And then you spend three or four hours trying to work out what the problem was and you find somebody was fixing little grammar errors on the website instead of working on the feature and you had absolutely no idea. So it all becomes really, really chaotic. And so, this little guy is there to bring a bit of order to the chaos.
[00:13:58]
So Supporty's a little TypeScript bot, it took an engineer a couple of afternoons to write Supporty.
[00:14:04]
It's very, very simple how Supporty works. We we we put him in the support channel. Users call him up, they get a little form, that little form asks them for some information. So we get some clear, consistent input. We have a set of priorities so that we can understand what should be worked on and when. And if you pick an pick up a support issue, you know what you should be working on and when. We can triage those issues out to different teams. Engineers can reply. Everything's done in threads, and therefore consequently that channel stays nice and clean. Which is super, super important. And then we've got other bits of functionality where we can hook it into Jira and and so on and so on. And then really importantly, we push a bunch of stats into our data warehouse from this as well.
[00:14:56]
So, here's an example of what our little channel looks like. As you can see, you've got different different squads, you've got different issues. And you've got a really simple view of the current state of open issues.
[00:15:11]
And what's really important here is this is a completely open Slack channel. So anyone in the organization can join this Slack channel, I think at the moment it's got about 250 people who who are in it out of a company of about 350. And that's really important for building trust, being open and transparent. And I think that's really, really important when you're talking about production support. It can be a bit of a trap to kind of try and hide it away, thinking it's a bad thing.
[00:15:37]
Now, there are obviously off the shelf solutions that can do this, uh, quite effectively, they're quite expensive, you pay by the user. This was a cheap and cheerful solution, which was right for the stage that we were at at the time, so as we were scaling up.
[00:15:55]
Now, talked a little bit about trust, what's also really, really important here is having clear priorities. Because clear priorities build trust between you in the engineering team and the rest of the business. It seems really simple, but having this mechanism with clear examples really helps into those users understand. And it enables them to stop worrying about whether somebody hasn't commented on something that is needs a fix in 24 hours versus the thing that needs to be fixed straight away. And these examples are super important. So our P1s are, the sky is falling in. The website is down. Or we can't, for example, book our deliveries on a particular carrier because there might be a, let's say, a bug in our booking mechanism. Whereas a P2 is something which a little bit less important, a little bit less time sensitive. Sometimes because that time sensitivity hasn't occurred yet. So, as an example, we stop booking things on carriers for delivery tomorrow at around about 10 o'clock in the evening. Uh, so it's 9 o'clock in the morning, if you can't book things, it's not as important as if it's 9 o'clock in the evening and you can't book things.
[00:17:12]
And then P3's is a kind of everything else, but for things like grammatical errors. Bugs that it would be nice to be fixed, but we can put them in the next sprint, for example, they don't necessarily need to be fixed straight away. So this is super important as an alignment mechanism with the rest of the business and that alignment builds trust.
[00:17:37]
So that brings us on to our our second learning. So, as you scale, you will need to bring a bit of order to your chaos. And tools can help you bring a bit of order to that chaos. You can make your own, you can buy them. Jira Service Management, for example. Uh, if you're in the Atlassian ecosystem, I think it's about $15 a month per user or something. You could write them yourself, we certainly didn't spend that amount of money writing supporte. Make sure you can get statistics and data from your tools, whatever you do. I'll talk about that in a little in a minute.
[00:18:14]
There is a trap. This is this is the trap if you start using tools. So if you start using tools, it's very, very tempting to hide behind the tools. And start to cut yourself off from the rest of the business. So just be aware if you're scaling up a support process, you start to put a tool in place, I would strongly recommend you keep it open, visible and transparent.
[00:18:37]
And you set some clear expectations on both sides. So these are actually our expectations for, I guess, users of Supporty. And so, we, we for example, we love it when people flag issues and tell us what's wrong. We love it when people remember there are humans behind Supporty. So there are, there are, there are real engineers and real people with real feelings. We love it when people go into detail, we love it when people respond promptly when we need more information.
[00:19:08]
We, let's say, don't love it quite so much when people just continue to Slack their favorite developer. Uh, because then we have no idea what's going on. Our our engineers have got far better now at pushing people back into the support channel when that happens.
[00:19:24]
Uh,
[00:19:26]
we'd rather people found workarounds if there is a known workaround.
[00:19:30]
And we don't, if people aren't going to get it perfect, and so the most important thing, particularly with an issue that somebody, or a potential issue somebody spots in production, is that they tell you about it. It's much better to say actually that's not a problem than somebody sit there and worry that there is a problem.
[00:19:51]
Okay, the next step in our gardening journey. So Bloom and Wild's got bigger. So we're maybe about 30 people. So we're in five countries, four languages, three currencies. Maybe in your gardening journey you've got something a little bit more. So, so a raised bed in the UK is a is a big flower bed, wooden, uh, wooden sleepers on on each side, usually used to grow vegetables in, I have one in my garden, they grow vegetables in.
[00:20:21]
So, let's go for another guess the flower. Does anybody know what flower that is?
[00:20:28]
Don't feel bad if you don't. When I joined Bloom and Wild, I knew nothing about flowers at all.
[00:19:50]
Okay, the next step in our gardening journey. So, Bloom and Wild's got bigger. So we're maybe about 30 people. So, we're in five countries, four languages, three currencies. Maybe in your gardening journey you've got something a little bit more. So so a raised bed in the UK is a is a big a big flower bed, wooden uh wooden sleepers on on each side, usually used to grow vegetables in. I have one in my garden, they grow vegetables in. So, let's go for another guess the flower. Does anybody know what flower that is?
[00:20:28]
Don't feel bad if you don't. When I joined Blooming Wild, I knew nothing about flowers at all.
[00:20:36]
That is an Almia.
[00:20:40]
Which is apparently the same in French and as English.
[00:20:45]
It is available in this lovely bouquet from Bloom and Wild.
[00:20:50]
Okay. So, I talked a little bit about making sure that you have data, making sure that you are pulling or or pushing data out of your support process into a data warehouse as soon as you start to put tooling in place. That's super, super important. So, here's an example of why. So, this this is a P2 issue we had a few years back. And we started getting reports that people in the in the US couldn't purchase flowers. Now, we don't ship flowers to the US, but the beauty of our online service is that if people live in the US and they have people they want to send flowers to in the UK, they can use our service. And we we'd we'd implemented, we'd changed some security configuration, uh and we started getting reports that people couldn't use our app from from the US. Now, it turned out that configuration change had blocked the whole of the United States of America from using our app or our website. Which was not a good thing. Big lesson learned there. But the kind of the cost of that P2 incident. So so looking into supportive, that P2 incident, the thread for that had 143 replies in it from 12 different people. And it took about five hours to resolve in all. So, this cost the business a fair bit of money. Because those 12 people, their salaries are not cheap. Because they're in the engineering team.
[00:22:31]
Now,
[00:22:33]
time well spent. You look at this App Store review where somebody using our Android app had given us a one star, and afterwards they came back and changed it to a five star because we fixed their problem.
[00:22:47]
That's great. But it's really important to understand the cost of it. And that's where statistics and data are your friend. So, as you get bigger, you are bound to be asked if you're in an engineering leadership position, why is your team slowing down? So you get that tap on the shoulder from the CEO. I've had it a couple of times. The team doesn't seem to be going as quickly these days, does it, Steve?
[00:23:14]
And it's really important to understand the impact support is having on a team and a team's ability to move quickly. So, being able to have the data, being able to see, for example, and set set KPIs on things like how quickly you're resolving issues. This one's quite interesting, which is how many issues do we get raised that actually result in us making a code change? So, if you look at the one there, you'll see a large number of issues, a small number of code changes. Now, that could mean not that that's a really buggy part of our system, but that is an incredibly complicated part of our system. And so often, the solution to excessive support load is not necessarily making sure your code is of higher quality, it's making sure that your apps are actually easier to use. And that's definitely the case for this area of our system. But we wouldn't be able to identify that without being able to visualize it in this way.
[00:24:21]
And you can go further. You can you you can, for example, as I've I've done over here, you can map on actual average engineer salary. So you can say roughly, this is the cost of of what support is actually actually costing us. Should we make some targeted investments in some areas here either to make our system uh less buggy or to make our system easier to use? So it's really important, in my in my opinion, support stats are just as important as any other any other visualization you have on the way on other areas of your system. They're super, super important. So if you observe and measure systems, you should observe and measure support.
[00:25:05]
So make it visible.
[00:25:08]
Now, as an aside, if you have been tapped on the shoulder by your CEO and and your CEO has said, why why is everything moving so slow? I I do have a presentation for that, uh if you if you are if if you do need any advice. Anyway, back back to this presentation. There's a couple of other metrics that I think in the context of support are really important to look at. The first one being change failure rate. So, change failure rate, I'm sure a lot of you know it, it's uh it's one of the Dora metrics. It's the number of production incidents divided by the number of changes. Or in other words, how often do you make a change in production that breaks production?
[00:25:54]
And it's very useful. A high change failure rate, for example, could indicate that what of your teams maybe doesn't have a good enough handle on testing and quality practices through their life cycle. It could be that they have an inefficient and error prone deployment process. There's lots and lots of different reasons. Now,
[00:26:14]
in order there's lots of ways you can measure this. Simplistically, we're measuring ours by looking at the number of deploys and looking at the number of major incidents and tying the two together. Uh,
[00:26:30]
the other one that I think's really important to look at is time to recovery. So, your recovery time enables you, a short recovery time enables you to move more quickly pre-deploy. Because you can recover from a particular incident or an error much more quickly. Because let's be software is complicated.
[00:26:53]
We can move super slowly by testing everything or we can move more quickly by not seeking perfection. Sometimes things go wrong and that's important.
[00:27:04]
It's super easy to measure.
[00:27:07]
And it's important. So investing in good monitoring, good alerting, good observability enables you to minimize your time to recovery. Making sure that you have runbooks for key scenarios minimizes your time to recovery. If something goes wrong, you want to cut the cognitive overload overhead of getting to the root of the problem to as short a period as possible. So runbooks are really important.
[00:27:33]
Having as fast as possible deployment pipelines are really important because there's no point getting quickly to the root cause of the issue and then finding it takes 45 minutes to deploy the fix. So investing in fast deployment pipelines is not only good day-to-day but it's super important for supporting your systems and recovering from failure. Practicing key failure scenarios can also really, really help. Because it means that your engineers get some muscle memory effectively for when things go wrong, so when there's a bit of stress and chaos, when there is a is a major production incident, they know what to do. And then finally, holding post-mortems and making sure that you're learning from any of your experiences and applying those learnings to make things better next time. So, that's learning number three. Collect your data so you can learn from it and and make sure you are recovering quickly. And discover the cost of the incidents as well. What's it cost your business when there's a failure? Do you even know what it costs?
[00:28:39]
Okay, next stage in our gardening journey. I think the previous guessing the flowers was a bit hard, so I I'll give you an easy one now, right?
[00:28:48]
What's that?
[00:28:51]
It's a sunflower. Guess what? It's available in this beautiful bouquet from Bloom and Wild.
[00:28:59]
So, around this time, everything was fine in hours. So in hours in Bloom and Wild means from 9:00 a.m. in the morning until 6:00 p.m. in the evening. So we had this we had a mature process, we had tooling, we had data. And people were following the process.
[00:29:18]
But a little bit like Cinderella, but a bit earlier, at 6:00, when the sun kind of went down or the clock struck six, this all went away. And so after 6:00, what would happen if there was a major incident, is it would usually make its way to the CEO or the COO. I would get a phone call. And then I would try and find an engineer who could help me fix the problem. And this isn't a good thing.
[00:29:50]
This used to make me feel like this.
[00:29:55]
Because when you don't know when your phone is going to ring, and you don't know whether you can solve the problem, it's a really, really horrible situation to be in. It's a particular, it's a bad situation for one's mental health, frankly, to be in that sort of situation. Now, what this meant was, this was the time to use all the data about the cost of support and about the the the cost of incidents to the business to build a business case for paid out of hours support.
[00:30:26]
Now, this is something I wish I'd done much earlier in the journey at Bloom and Wild. It was an absolute game changer, not only for my health and well-being, but also that of my engineers.
[00:30:37]
Uh,
[00:30:40]
So, I mean I just got that off the internet. I have no idea what that what that man is doing or any of the context at all. Uh, but So out of hour support, you can think of it a little bit like an accident and emergency department at a hospital. So the role of your out of hour support is effectively to stop the patient from dying so that and then handing them over to people that can help them get better.
[00:31:07]
So in the context of software, that means your out of hours support is there to mitigate the issue in the middle of the night if it occurs, they're not necessarily there to fix the issue, but in hours they can then hand it over. Now,
[00:31:23]
I think all companies should pay for their out of hours support. Yeah, the reason I think that is because it's fair, frankly. Having your phone not knowing whether your phone's going to ring in the middle of the night isn't fair. Thinking your phone might ring in the middle of the night because you are the engineer who most often answers your phone in the middle of the night, isn't fair. Me having my phone ring in the middle of the night because I'm in charge of the engineering department, I would argue is also not fair. So, paid out of our support is fair. It helps set really clear expectations. It reduces the time to recovery massively because you can rely on the fact that somebody will answer the phone and they know what to do, irrespective of the time of day. So it's good it's good for the company. It's good for the engineers. In the grand scheme of things, when you think about the cost of a production incident, it's not actually that expensive to the company either. And it's not hard to craft a good business case for out of hour support. And I think all companies should pay for their out of hours support and they should do it early on in their in their company journey. I wish I'd done it sooner.
[00:32:35]
Now, we like any process, when you introduce something new, sometimes it doesn't go perfectly. So, this is this is Bernat. Bernat is asleep. Bernat is a real person, by the way, that is actually Bernat. Uh, he he's asleep when Ops Genie, which is the tool we use for out of hours support, uh, we receive an Ops Genie alert. Now, what Ops Genie does is Ops Genie phones up Bernat. Now, it will keep phoning Bernat up at five minute intervals for 45 minutes until Bernat answers the phone. If he doesn't answer the phone, it will phone me up. I've got great engineers, my phone has never rung. So, the phone rings. It wakes Bernat up at about 2:00 in the morning. Bernat gets up. Opens his laptop.
[00:33:24]
Logs on, sees what the issue is.
[00:33:29]
I've ringed the issue in red.
[00:33:34]
Bernat closes the issue, Bernat goes back to sleep.
[00:33:38]
Now, I don't know whether you're anything like me and you're wondering what it was, I'm wondering whether it was Lionel Richie.
[00:33:48]
I don't think he's in our customer support team.
[00:33:52]
Okay, so the learning the fourth learning here is make it easier, make it fairer. Out of hours, do that early in the you think, pay for your out of hours support for the good of you and your engineers, please.
[00:34:05]
Okay, so we've moved on. You've really got into your kind of like gardening, florry street game, you've got a whole flower garden.
[00:34:15]
What flower's that?
[00:34:21]
That is a peony, that's a coral charm peony. Available for about seven weeks of the year in this beautiful bouquet from Bloom and Wild. So, May is peony season, if you like peonies.
[00:34:33]
Only only in May. So, the the astute amongst you might have realized, I've talked a lot about production support, but I haven't talked about uh the angle of of support which is how do you actually do the maintenance and the gardening as I would call it of your systems. So,
[00:34:53]
things like keeping frameworks, languages up to date, fixing the little annoying things like slow queries, flaky tests, upgrades. As I mentioned at the beginning, we started with a Ruby on Rails monolith. We still have quite a lot of that Ruby on Rails monolith today.
[00:35:09]
So, around about the time we start splitting and we start we go from one team to multiple squads, we start to realize we have a problem. So, because we kind of expect everyone to care about everything, that means sometimes nobody's caring about the right things. Sometimes too few people are caring about the right things.
[00:35:32]
Sometimes too many people really care about the wrong things and then get really frustrated when those those things don't get prioritized. So, we introduce a role that we call the gardener. We have to call it the gardener.
[00:35:45]
So, the gardener's responsibilities are these. So, they're there to to to monitor to fix in hours uh production issues, fix the staging environments. Merge test, bump dependencies for security fixes for example. Help us upgrade things like Ruby and Rails. Fix where we've got shared tests and they're a bit flaky, fix those. And then generally help squads with our staging and our production environments. And at this point we were of the size where this was a single person.
[00:36:21]
And there were some real advantages to this approach. So it's enabling engineers to understand areas outside of their core code base. It helped to maintain shared ownership for things that had to be shared as a nature of having a monolith. And for our our engineering managers, it made managing capacity much easier. Because they could just take a one engineer or two engineers out of the pool and say they are doing this for one week or two weeks. Rather than they then just having them work on something together with their other work. So it really helps manage our monolith and our shared monolith problem. And and also share that responsibility amongst our community of Ruby engineers.
[00:37:09]
So, that brings us to our fifth learning. So supporting your systems is not the same as supporting your users. And particularly if you need to support monoliths and support shared problems and assign single ownership.
[00:37:26]
Okay, so we're almost through our garden gardening journey. There's only one more flower to guess. And this one is very tough.
[00:37:35]
Now, I've called this section the allotment. An allotment in the UK is basically it's basically a vegetable garden, a shared vegetable garden. And uh, I I have an allotment, uh, it took me nine years to get my allotment. I I got it by spending six years maintaining the website for the Allotment Association. During five of those six years, the Allotment Association didn't even own any allotments. Which made my job quite easy. Uh, what flower's that?
[00:38:06]
Nah.
[00:38:10]
That is Crespedia Globosa. Do you want to know why that's my favorite flower? It's my favorite flower because it reminds me of the microphone from a 1970s game show.
[00:38:27]
Okay, so this is roughly where Bloom and Wild is today. So, 65 people, multiple streamline squads, uh with a platform team operating as a in an as a service model.
[00:38:44]
And we're supporting a business of about 350 people at the moment.
[00:38:54]
And that means we have some scaling challenges. So, just like anything when you you're gardening, when you're looking after something, you turn your back and your lovely garden ends up looking like this.
[00:39:08]
So, as the team scales, we get new challenges. So, as the team scales, we build more solutions, we build more functionality, we build more complexity, it becomes increasingly hard for one or two people to hold that entire complexity in their own heads. it becomes increasingly hard for one or two people to hold that entire complexity in their own heads. It becomes impossible.
[00:39:27]
We hire lots of new engineers. Some of our older engineers leave. And despite our best efforts, the knowledge does not always get passed on, and particularly the tacit knowledge.
[00:39:40]
And we find the gardener doesn't really work quite so much anymore. Because it's weekly, it focuses on short-term fixes. It doesn't anchor responsibility in our squads. And we often find our engineering managers and our product managers start to sort of trade off the gardener's time when high priority work comes in in their squads because they've got a bit of available capacity. And they under understand much more about the feature they need to deliver than whether it's really necessary to bump those dependencies this week.
[00:40:12]
And so we make some changes. Quite simple changes, think of them a little bit like a shared vegetable garden, like this one here. So think of them like an allotment.
[00:39:38]
And we find the garden that doesn't really work quite so much anymore. Because it's weekly, it focuses on short-term fixes. It doesn't anchor responsibility in our squads, and we often find our engineering managers and our product managers start to sort of trade off the gardener's time when high priority work comes in in their squads, because they've got a bit of available capacity and they understand much more about the feature they need to deliver, than whether it's really necessary to bump those dependencies this week.
[00:40:13]
And so we make some changes.
[00:40:16]
Quite simple changes. Think of them a little bit like a shared vegetable garden, like this one here. So think of them like an allotment.
[00:40:24]
So we moved the roll, and the roll becomes a responsibility of the squads. So this anchors accountability and ownership in the squads. It means that engineers can focus on their craft, enhance their capability and uses the experts to care for the domain. Our platform team gives them the support they need, whether that's observability, whether that's monitoring, whether that's dashboards, whether that's focusing on how they can improve their operational health. We give them a set of clear KPIs and we tie them together with a a head gardener, which is one of our principal engineers, to make sure that there is some consistency in terms of what they're working on because a lot of them are still working in shared code bases.
[00:41:09]
And we make sure that we're making that investment clear. So we we get our squads to tag the work up that they're doing, whether it contributes to a strategic goal, a tactical piece of work, or whether it is operational. I.e. is the overhead you have for operating your software in production irrespective of whether you're actually changing anything or building anything new. This sort of view also really helps me when I'm discussing tradeoff and investment decisions with the exec. Because I can say there is a base load in these teams, we could reduce that base load if we make targeted investments to reduce that base load. And particularly, for example, drive down areas of technical debt or complexity.
[00:41:53]
Now,
[00:41:55]
we could have just made a team of gardeners and pulled the gardeners all together in a team and then rotated them around. We didn't do that because, well, that's a horrible team to be in, being that fixing all the broken stuff squad. Uh, and it's also really hard to justify because, like, what's it what what it's harder to justify the direct value a squad like that is actually actually generating for the business. It doesn't anchor the ownership in the right place.
[00:42:25]
So, our last learning: make autonomy and make ownership your goal. Don't build a team of gardeners.
[00:42:34]
So it's all good, right, now? Oh, the world, the world looks like this in Blueman Wild. Of course not. It's always a work in progress, right? So it's not perfect. Uh, our monoliths still hold us back. We're working on better better ownership within our monoliths and componentization within our monoliths to enable that that ownership model. We're working on better processes and and uh tools and training so that squads can take ownership. And arguably we might start to have outgrown supporting our lovely little slackbot as well. Oh, in case you were wondering, that is my garden. I think I need to mow the lawn.
[00:43:14]
Okay, so,
[00:43:17]
if you want, if you build something, you need to support it. So make sure as you do that, you're setting clear expectations, make sure when you introduce process and you introduce tools, you do that at the right time.
[00:43:32]
As early as possible, collect data, make sure that you can learn from it. Please, please pay for your on call.
[00:43:42]
And your ultimate goal here should to be should be to make ownership and autonomy of squads your goal.
[00:43:51]
Your codebase is your garden. So if you keep it nice and tidy, other people will notice too. Thank you.
[00:44:06]
And we now have time for questions. I will take questions about flowers if you like, but I'd rather they were about production support.
[00:44:14]
Uh, thanks, Steve. Uh, your the the gardener model that you've got at the moment, where each squad has a gardener, do you mandate how they allocate from their team to be a gardener? So does some squads have one person that's always the gardener, or do they all rotate on a regular basis?
[00:44:32]
Uh, we we we we don't mandate what they do, but all squads rotate on a on usually on a bi-weekly basis. So we run we run a bi-weekly cadence across all of our squads, and so I think they rotate bi-weekly mainly because it makes the capacity management for the engineering manager easier.
[00:44:49]
I think it's important they do rotate, by the way, because it it shares it it it it shares that learning, it shares that opportunity, and it also avoids somebody becoming a single point of failure or always being forced to do the support work.
[00:45:13]
Yeah, thanks for the presentation. I have a question about the priority and the level of priority that uh people are setting when declaring a support ticket. Is it something that you personally review or you trust the reporter with the priority, or everything ends up in priority one, or how how does it work? Uh, yeah.
[00:45:32]
Yeah, so what what happens is the the the the person who raises the request proposes a priority.
[00:45:41]
And then we may choose to adjust that priority together with them. Uh, because yeah, you do you do sometimes have exactly the behavior that that that you've explained, which is something that's very, very important to me, I might think is a priority one, but that's why it's really important to have a set of examples about what is a priority one and what isn't a priority one and making those really visible because then where somebody does kind of maybe try and push it or make it or or make a bit of a mistake and just get the wrong one, you can just say, this is how we prioritize things and this is why we are, this is why we say it's a P2, for example.
[00:46:27]
Question from the back.
[00:46:28]
When you have a low bug priority, do you purge it or do you keep uh logs of uh very old bugs that will never be fixed?
[00:46:39]
So when when we, so I guess with with thing with P3 request, different things can happen. So a P3 request can be a genuine bug, and then, uh, that bug may go, that bug would go into a backlog. Uh, we try every maybe three or four months to purge the backlogs because, I think that when bugs get too old, they have very, very little value. Uh, and there's there's actually more value in somebody actually going through the effort of writing the bug again than actually trying to just go through and understand the bug from something somebody wrote three, four, five months ago. Uh, I mean, the thing with P3 with thing with P3 support requests is they often they're often used as a bit of a kind of random dumping ground for, I don't really like this thing. So, sometimes they don't even turn into bugs. Sometimes they're suggestions for features and somebody's just tried to use slightly the wrong process. So again, as long for me, as long as every every support request has is connected to an action that happens in a squad. So the person who raises it knows where it's gone and they know who they can go and follow up with, then I'm happy. And so yeah, for if an engineering manager or a product manager can say to somebody, okay, actually, I don't think your bug is particularly important and we've therefore we've closed it and we're not going to fix it. As long as they've had that feedback loop, I'm happy.
[00:48:13]
Do you have the same engagement from all, all your engineers and how do you educate around that topic?
[00:48:24]
I think like anything there's there are differing levels of engagement. I think there are I have some engineers who actually really like being on support. Because there's there's a very there's a very short feedback loop on support. And so there's a very quick kind of like almost like dopamine hit of, there was a problem, I fixed the problem, great. Here's another problem I fixed the problem. So being on support sometimes can kind of make you feel quite good for something. So some of our engineers really like being on support, some of our engineers don't like being on support.
[00:48:56]
Uh, we we use it as a we sell it as a learning opportunity and and also an obligation, frankly, as much as anything else. So if you if you write software, you have an obligation to operate it, monitor it, and support it. And that's a that's part of our engineering principles, that's a non-negotiable.
[00:49:14]
But I think in a lot of cases, because we can it's a it's a short-term, usually a short-term two-week rotation as well. Even if people really hate it, they go on, they go on it, it's something they just need to do and then they they kind of roll off it. But yeah, I'd say engagement, engagement varies between the between the team, but it's a really important thing that everybody does.
[00:49:38]
And so, can you expand a little bit on the out of our support? So I understand that you want to minimize the calls to the engineers at midnight. Uh, so how how do you find the right balance of onboarding and educating that uh service provider, whoever it is? And how how much do you invest in them, you know, building the skills needed to not have to escalate, which I imagine can happen at some point?
[00:50:02]
Yeah, so I guess there's there's two things. So, so that out of out of hours group is a small group, so it's five of our most senior engineers, so that they're I guess they're they're the most engaged in what we do.
[00:50:15]
I run that group as a very autonomous group, so the group makes the decision for example on what alerts will trigger them and wake them up in the middle of the night. I don't make those decisions, and I think it's really important to have that level of ownership and that level of engagement in the group.
[00:50:33]
That way, that group is really bought into the fact that yes, they don't want to be woken up in the middle of the night if they don't have to, but B, they are being compensated in case it happened. So we we pay irrespective of whether their phone rings, we pay them a fixed rate every week. Uh, it doesn't matter whether the phone rings or not, because we're paying them for the inconvenience, the potential inconvenience, so they are, we're paying an insurance policy, basically. Uh, we give that group training, we get that that group tells us what runbooks we need, and then we get those runbooks with the relevant squad that owns that particular area. When we build new features, then upskilling the out-of-hours support team is a definition of done for a large feature. And so by doing it that way, that group is as well trained, as well educated and as engaged in making sure that they don't get woken up very often. And that that that that keeps that kind of engagement level and the happiness in that team high. Because I think if you don't do that, then it does just become an obligation. What you what you want is it's it's like a set it's it's a it's a meta level order uh instance of, I'm operating and monitoring this in production and and supporting it because it's my job. It just happens to be that we don't want to put all of our engineers on out of our support, we do want to have a minimum bar for entering that team. For two reasons: number one, because we want the people on that team at 2:00 in the morning who don't need to escalate, ideally. But also, secondly, and this is a really interesting kind of dynamic in a team like that, is if there are members in that team who, being in that team is important, but actually getting paid for being in that team is important. The more people you put on that rota, the less people make in in over the course of a year because the less they're rotered onto the support. So there's kind of a balance there. So we also make it's the decision of the team about how big the team is as well. So it happens to be five at the moment. If they if if I wanted to add a sixth, I wouldn't go to that team and say I'm adding a sixth person, I would go to that team and say, should we add a sixth person? So it's really about making that team super engaged in that way and bought into the problems that they're trying to support on.
[00:52:58]
Ah, yes. Does it work? Oh, yeah. Okay, there we go. So thank you for the inspiring talk. I have a quick question because when I was looking at the slides earlier, I think I didn't see any quality engineers in the teams. So I was wondering how does Blue and Wild work to make sure that the code or the deliverables check the quality box before they are pushed to production.
[00:53:24]
Ah, yeah, I we we do actually we do actually have some some test engineers in some squads, just wasn't I think it just wasn't shown on the slide. So some of our some of our squads have test engineers, some of our squads don't. It depends primarily on how much customer-facing tech estate they've got. So, for example, our our teams that own slices of our customer journey that are customer-face, like external customer-facing. So, for example, that own parts of our web app or own our mobile apps, they have test engineers in because that feedback loop otherwise is extremely long to to that customer, and also the blast radius for a problem is extremely wide. Uh, whereas our teams that own our production and our fulfillment software, so that's the software that's running in our production centers that either deals with the inbound for the flowers or it deals with producing the flowers actually on the lines and fulfilling them. That there's there's only a very few people, internal users of that that system and they're very, very closely aligned with that team anyway. So that team doesn't have test engineers in it, they test they test their own work and they've got a they've got a high level of uh, of of test automation maturity as well. But they also got the they also got the internal users sat very, very close to them anyway, so they play they play a part in the in the kind of broader QA process.
[00:55:02]
Okay, well, thank you so much for coming along. I hope you learned something about flowers as much as support.