Woody Rousseau et Flavian Hautbois - Qualité radicale

[00:00:05] Hello everyone. We're going to be the talk that will draw a parallel between tech and industry. And so it's quite amusing because for the first time when we read the book we're going to talk about, the person from Toyota described what she did as "extreme programming for industry". So for once, it went a bit the other way around. And so we're here to talk to you about quality, today about radical quality. And so to begin,
[00:00:30] We all hate bugs. Bugs mess up our lives, from the applications we use to the banking advisor who tries to do something for us and can't because their system doesn't work. And we produce a lot of bugs in tech. We always say, "Bugs are just part of life," there are things, phrases that come out like that, "It's not a bug, it's a feature." And so we're going to try to deconstruct that a little bit.
[00:00:55] And what's quite astonishing is that bugs are considered normal in our industry, and yet the consequences can be absolutely delirious. I think you've heard about the bug that caused problems with the Ariane rocket. I found a slightly more exotic example: it's a machine called Therac-25, which was a machine to treat cancers, so it sent electrons or photons in the 80s. And so this machine killed five or six people because there was a global variable in the code. uh, which was uh, with a race condition, the thing is super complicated, but in any case, it was sending 100 times the dose of radiation that was supposed to be sent to treat the patient, and so we had patients who died. So we see that there's a paradox between the fact that we find it normal, which is less the case in industry, and yet the consequences are serious. And uh, in fact, when we ask ourselves what the reason for this is, I think we have a misconception, a false idea, which is that non-quality costs less than quality. So there's a book that's not bad. uh, that I read, that was written by guys from IBM, called "The Economics of Software Quality," which is actually a book that allows you to evaluate the economic impact of quality versus non-quality. So what you see on the x-axis These are function points. So it's the way IBM uses to measure the size of applications. So the more function points, the more functionalities, and the bigger the application. And so they were able to make a difference between projects that had high quality practices, so extreme programming, uh, test, etc., versus projects that were rather light on these aspects. And so we see that the impact is still very big for very large applications, it amounts to several hundreds of millions of euros. They have analyzed thousands of IT projects since the 80s to get this data. And beyond that, what we see is that even in percentage, the bigger the application, the higher this cost difference, in fact, to produce applications. And so we arrive at 25% difference between high quality practices and low practices. So the impact is super strong. And I find that this book has the advantage of being able to highlight that this false idea, well, in fact, it's really false. And that doing well the first time costs less than uh, than doing anything.
[00:03:10] Yes, absolutely. And uh, and the Accelerate book goes even further than that, the idea is that they look, they try to find clusters of companies that have good practices, and they manage to make a correlation between companies that are very strong on
[00:03:25] uh these four uh metrics, so the deployment frequency, the lead time, uh the percentage of deployment that introduces an incident and the lead time to resolve them. And so they manage to correlate that with business performance and uh performance uh, if you can, hop. Thanks.
[00:03:44] And uh and the non-commercial performance, that is, the satisfaction of people, etc. And so this first cluster of Dora metrics shows that in fact the four other metrics are correlated with each other. The faster we deploy, the less we do, the less incident, uh the less lead time is low, the less incident, etc. So quality and speed really go hand in hand.
[00:04:04] So from this observation, we decided to adopt a strategy which is the zero defect, zero bug strategy. And we didn't decide to adopt it because we consider it an end in itself. I think there's a huge path for tech to really achieve it. But on the other hand,
[00:04:21] I believe in it.
[00:04:23] Me too, but in a long time. But in any case, we think that it's the best strategy to train tech teams that are unbeatable and therefore better than those of competitors. That if we manage to ensure that each developer puts an insensitivity every day to make code that never breaks, we think it's a way to really separate ourselves from the competition. And that's a that's a that's really a strong concept in Lean, and we'll talk about it in a bit. So to introduce ourselves quickly, we both have this fight against zero bugs. So my name is Woody, I'm the CTO and co-founder of Syos. We are part of a group of service companies, Teodo group. And so Syos, we are specialized in financial services. So we typically do neobanks, fintech apps. Uh, and so I started practicing what I'm going to tell you about, about a year ago, a little over a year ago.
[00:05:10] The idea is that I share a feedback with you.
[00:05:14] I'm Flavien. Uh, I also worked for Theodo at one point. Uh, we actually worked a little bit together on a project a long time ago. And after that, I've kind of followed a separate path, I've become the CTO of a company called Appricity that uh that did fertility treatments, so the quality issues were huge because behind it we had medical treatments for uh people who were also, it's quite difficult psychologically fertility treatments. And I started my own business recently. I'm interim CTO for a company called Ocus, and I'm co-writing a book on uh digital product development, uh especially from a technological angle where I'm a bit of a, I'm always a geek, I code a lot.
[00:05:56] And uh and so Lean, for me, is really uh uh a response to how we manage to build teams that are excellent uh to make excellent products.
[00:06:10] Sorry, it's always me. So, uh, what exactly are we going to talk about? We're going to start, so we're starting with the Toyota angle in industry, with Sadao Nomura. Uh, we're going to make the bridge between that and our practices in tech. Uh, from a more general point of view, and then get to some specific examples of what Woody and I have tested separately in our context. To finally, at the end, release a rather lightweight framework that you can take and take home to test it at home.
[00:06:39] And so we're going to start with the industry, with the story of Sadao Nomura. So that's his head we photoshopped, so we apologize for him. And so Sadao Nomura, in fact, he is, I consider him as Mr. Quality of Toyota, that is, he is someone who was sent by Toyota to the factories that produced too much non-quality for Toyota to accept that they export cars, in fact. So he was really the guy to call when things weren't going well. And so he had notably helped factories in Australia, in South Africa, uh, notably in South Africa where there was a very difficult social context, it was just after apartheid, he managed in that context to significantly improve quality until Toyota accepted to export. And so he wrote a book that is this book, uh, which is the basis of everything we've done for more than a year, uh, which is called "The Toyota Way of Dantotsu." So Dantotsu means "better than the best," that is, it's the idea of doing better than the best. That's really the philosophy of Sadao Nomura. And he recounts what he experienced in several countries, so in North America and Europe essentially, also inspired by learning he had had in Australia and South Africa.
[00:07:42] He did that on a scale of 11 companies, by the way. And uh his objective, in fact, was quite clear and lucid: he wanted to divide by two the number of defects on the forklifts that came out every year. Uh, all of that to achieve uh minus 90% reduction in three years.
[00:08:02] So you have to imagine what that means, if we have 100 bugs in year one, three years later, we only have 22 and then it becomes really ridiculous. And so he managed to do these programs several times in a row.
[00:08:14] That is, with these factories, he managed to achieve -88% three times in some factories. So it really gave uh gigantic gains for the factories where he was going to work.
[00:08:26] Yes, exactly. And one of his guides was already to start by classifying bugs. So we tend to classify bugs by importance, severity, etc. He chose a fairly simple system, centered on a team, in fact the, the bugs of type A, it starts with that, that designates the seriousness of the bug. So if it's caught in the team, it's less serious because it was never transmitted to other teams. Type B, well, it's the, it's the bug that a team that makes a piece transfers to a subsequent team. Type C, that's the one that will be uh suffered by a service provider, for example, uh, there, I send that to a box that will then deliver it, if the box realizes that there's a defect on the forklift, it becomes a type C. Type D is the one from the final customer, uh, there's one of our cars that uh has an accident, it's the worst that can happen because uh because in the end the customer has to return the forklift, he's not happy, etc. He sees the thing. And the idea behind that is to say that all type D bugs are actually, before that, bugs that, well, bugs, defects that could have been type C, B or A.
[00:09:33] Uh, defect is broader than bug, uh, we'll talk about it. Uh, and so there it is, that's his system to try to draw attention to quality.
[00:09:42] So obviously, the further we go in the flow, the more expensive the defect is. That is, it costs more to repair, uh there are lawsuits, there are complaints that are sometimes paid by Toyota when it delivers defects to customers. So there is really an idea of
[00:09:58] uh to categorize them, to see what is the impact on the value that Toyota does not bring to its customers by doing A, B, C, D. And so how do they do that? There's an idea that's central to the book, which is the idea of the 8-step procedure. And so it's a procedure that is uh that is at the hand of the team leader. So the way it's organized for those who don't know well, so in the factories, there are team leaders who will take care of 6 to 10 people who are operators, so they will manufacture the parts. And so the team leader is responsible for these eight steps. So there's a first step where he identifies the problem on the part, typically a part that's dented, he's going to go look in the stock of his factory area to see if there aren't other parts that have the same problem, to be able to correct them also, eventually save them if possible, sometimes it's savable. He's going to investigate the root cause that introduced the defect, so is it a place where there's, for example, a screwdriver that's damaged and that causes this defect to be installed, he's going to implement countermeasures, which can be in the example I gave, changing the screwdriver. And then he will report to the daily meeting the resolved defect, so to explain what has been learned from the resolution of this defect, what was the cause, what was put in place to correct it. But it doesn't stop there.
[00:11:10] the team leader, with the help of the QA department, will deploy the learnings horizontally throughout the factory, so potentially in other processes where similar problems can occur. He will train the people, so the operators so that they can really be trained in the gesture that potentially caused the defect. And then finally there are checks, what we call Go & See, that is, they will go to the field to see if in practice, the way the gesture is made today is correct. And so what is uh uh quite striking when we see that is that there is really a desire to completely extinguish the cause of the defect. That is, I really see Attila, the grass doesn't grow back, there is a desire to say to oneself: this defect, never again in my factory. The second idea, which is still very strong, is the fact that all of this must be done in 24 hours for each defect. That is, you have to imagine, each time there is a defect on a part, the team leader does each of these steps until the end, he has 24 hours to do it. And so Sadao Nomura says a phrase that struck me: "Speed is key".
[00:12:06] that in fact speed is key for speed for quality. So we see there the steps in which it is done, so all these steps until the daily meeting are actually done on the day the defect is detected, and then all the others are done the next day. They are not planned. He doesn't say, "We're planning a training in two weeks," it's really the next day that it's done with all the operators.
[00:12:29] So there are several other aspects in uh, in, yes, if you want.
[00:12:33] You have to put it on because it won't work otherwise.
[00:12:34] Ah yeah, okay. There are several other aspects in the, in the book. The first is that they insist, to succeed in achieving zero defects, on visual management. So the fact of seeing the delivery, of seeing the problems. Uh, not only, but that doesn't work then. Oh, sorry.
[00:12:51] Uh, the second is standardization, so the fact of describing, in fact, what you need to know to do the product well.
[00:13:04] The third is the training programs, so the dojos, which allow everyone to be trained in the gestures and to do it as soon as there is a need.
[00:13:11] So the dojos, to be precise, are really production lines that are replicated in the factory, so they are really exactly the same as the areas where people produce, so that they produce in conditions that are very close to reality. And when they manage to produce an acceptable level of quality, they have the right to move on to the production line.
[00:13:30] Um, you also need what they call weak point management. So weak point management will be, we'll see the points with tech, but basically it's a way to solve the most difficult and most recurrent problems. Uh, the change management is, is the answer to the question, in fact, what do I need to know as a team leader for my team to function when there are disturbances, when there are changes, people go on vacation, when we change an upstream process, etc. What do I need to know, what do I need to pay attention to?
[00:14:01] And uh, then the 2S, so it's the fact that they have two conditions on what they call the 2S. It's uh, as soon as I need an operator who has his production line, he has to go fast, he has to do quality, etc. And to do it, in fact, he must have everything he needs available in less than one second. Uh, and uh on the shop floor, so where he, where he works, as soon as he asks himself questions about, "Hey, how do we do it, what's the standard for such a thing or, "Hey, I need visual management for, for something else," they can have access to that in less than one minute.
[00:14:40] And uh lastly, they have rituals around quality. So this is uh what's called a 'daily sheet', it's a morning meeting. So a daily, a morning daily, the daily stand-up to look at the quality issues. So there we see, they have plenty of space in front of them that they're going to look at, analyze.
[00:14:59] Sorry, I have a little trouble with the steps. Uh, so what does that mean in tech? And so, what's the, how are we going to look at quality and flow in tech? The first way to look at it is through the angle of measurement and objective. If we make a direct transposition, we could say, "Well, uh, we want to reduce the total number of bugs we have in the backlog to zero." But that's forgetting what happens every day, because in reality, we don't want to have any backlog at all.
[00:15:28] So we could also look at the number of defects that arrive by number of lines of code produced. Well, I think everyone here isn't very uh pro uh looking at lines of code for productivity. We know that there can be a lot of lines of code for very small things, configuration files, even two languages, we won't have the same thing at all, so that metric isn't, it's not the most interesting, but we wanted to mention it because a more complex one is to look at the function points, as Woody mentioned earlier, on IBM.
[00:15:56] It's the fact of trying to uh in the code to know approximately how many functionalities we have. But to know, you have to look at quite complicated things, uh, you can relate it to formulas, etc. We could almost say that these are user stories, but well, it's not quite the same thing. You have a different practice.
[00:16:11] What we do is that as we are a service company, my main metric as CTO is the number of bugs in production divided by the number of days I bill my clients, because it allows me, if I do x10 in growth, it's presumably normal that I do x10 in bugs.
[00:16:24] So it allows me to normalize the metric, we'll see what it gives later.
[00:16:29] If we look at what Accelerate says, in fact, they look at the change failure rate, they take it in reverse, they look at the percentage of deployments that fail. But that tells us nothing about whether we're really getting better, because already the incidents uh well, they don't describe them exactly in the book, uh but a priori, these are production incidents uh for for SREs. Uh and in fact, where it's interesting is perhaps rather to look at the number of defects by the number of deployments, which is a metric that will jump much less, that you use the word boolean, it will tend to do that first while the number of defects will be much more stable. Uh, we could say that the base unit is the number of deployments we make.
[00:17:11] What do the A, B, C, D look like in our case? In fact, the A B C D in the Don't Toutsu is a construction that is redone each time, that is, there is no method to find them. The only logic behind it is to say, uh, it's to say, "What's the seriousness of these things?" And so in the dev, we could, we could, for example, for a fictional team developing an app, uh, that has a flow with continuous integration, and continuous deployment, and a QA team that looks and that serves internal users and customers, we could define types A, B, C, D in this way, but we are not obliged. The first is, the first type of defect is to have to catch the defect directly in the developer's machine, it's a bit like extreme programming. Or to catch it in the CI, uh, or the CD if we have tests in it. Uh, the QA part, uh, will check the lines of code when it's downstream and so these will rather be type B bugs, type B defects, sorry, because they will, they will be more difficult, there will be more friction to understand and report them. Then once the code is deployed, uh the first uh the closest users will be internal users, internal operational teams, support, etc. Uh, I mentioned the banking advisor earlier, these are type C bugs because uh it will be even more difficult for them to express in fact what is a problem. Uh and for us to really quickly understand what's happening. And the most complicated, in any case those that are the most uh the most disastrous, you mentioned that very well earlier, are the type, the type D uh for the end customers who realize that they have an application that at worst crashes, uh at best has performance issues or uh or is buggy.
[00:18:55] And so, in fact, when we draw a parallel with the eight-step procedure I mentioned earlier, what I observe in the practices of most teams is that the steps are a little different. Generally, there is the bug that is reported by different people. The team, generally the product owner, will qualify the priority of this defect, that is, whether it should be corrected immediately and whether long-term measures should be taken on it. We prioritize the fix, we fix it.
[00:19:24] And sometimes we do a post-mortem when it's really very serious and we estimate that we absolutely have to explain to the CEO why it doesn't work. And generally the post-mortem will lead to countermeasures that are rather of the order of inspection, that is, we will add tests, we will add code review. And it will take time to do it too. And it takes time to do it, indeed, because generally I tried to do a bit the same temporal mapping. Often the first three steps are done at the moment the bug is reported. On the other hand, the fix of the bug, depending on how much it has been identified as a priority, it can take several weeks, the post-mortem even more, and then the post-mortem countermeasures, in general, we plan it for Q3. So we don't have at all the same the same intensity that is put in the desire to eliminate the deep causes of each defect and the speed at which we must do it. explain to the CEO why it doesn't work. And generally, the postmortem will lead to countermeasures that are more in the order of inspection, meaning we'll add tests, we'll add code review. And it's going to take time to do it, too.
[00:19:43] And it takes time to do it, indeed, because generally, I tried to do a bit of the same temporal mapping, often the first three steps are done at the moment the bug is reported. However, fixing the bug, depending on how quickly it was identified as a priority, can take several weeks. The postmortem, even more so. And then the postmortem countermeasures, generally, we plan them for Q3, so we don't have the same We don't have at all the same intensity that is put into the desire to eliminate the root causes of each defect and the speed at which we have to do it. To continue the parallels. So I've taken the different photos we saw earlier to see a bit how it translates for us. So the visual management of quality bugs is generally less reduced, these are often ticketing tools that are used for that, we see a list of defects. And it's very stock-oriented, meaning we're going to ask ourselves how we have fewer open bugs. We will less often ask ourselves how we introduce fewer bugs into the products, whereas that is really the will of Danto Tsu. Danto Tsu. When we look at the standards, here we have numbered pouches where at each workstation, we have clear standards on how to do the job, that's often less the case in tech. And so we observe it in code bases where there can be lava flow, for those who know the antipattern, it's a zone where when a newcomer arrives, he sees that there are three ways to do the same thing, he doesn't know which one to base himself on. Generally, he doesn't take the most up-to-date one, and so that creates code bases that are in difficulty. On the dojos, I find that we are still more advanced. In the sense that extreme programming has brought a lot of practices that allow us to train ourselves and practice regularly, whether it's katas, pair programming, code review. What I observe, however, is that it's often less just-in-time, that is, the idea of the dojo is also that the person, just before they need to make a gesture, is trained in that gesture. There, the katas are often in parallel with the production, so we tell ourselves that it's okay for the person to work and code features. And then if she has time to do katas, she can do it. These are generalities what I'm saying, but that's what I generally observe. Regarding the weak point management, that is, this desire to eliminate deep bugs, I find that we are also quite well equipped. There are tools like Sentry or APMs that allow us to really understand in depth difficult and recurring problems. I'm thinking for example of one of the weak points that I see in bug analysis, it's the N+1 queries, so database queries that are made in an excessive way. They are going to cause memory problems and so we are quite well equipped to dig into these weak points, I find that's quite good. And on the 2S part, that is, how we create an environment that is mastered by each developer. There are also good practices, so practices that allow to avoid having dead code, that it's easy for a developer to know where to place his code. Developer portals and systems that allow managing documentation. However, what I see is that it's rarely measured as finely as it's done in what's described in Dantotsu. This idea of it's okay when it takes less than a second and it takes less than a minute. Often, we don't go that far, we're not going to ask ourselves if it's okay for it to take 5 minutes for the person to find that they have to go into this code folder to write a line, we're not going to measure it. And the rituals, so we effectively, I mentioned earlier, daily stand-ups. Generally, daily stand-ups are more focused on delivery, meaning have I managed to deliver, typically in a Scrum model.
[00:22:55] the complexity points that I should deliver the day before. It's less quality-oriented and it's more oriented around the problems that the team anticipates coming in the day. Whereas the claim ASIC is, I present to the team the problems I solved the day before, so that the learning, so the intention is a bit different in both.
[00:23:13] But enough theory. Because I think it's also interesting to have feedback on what it really gives. So I'll let you take over.
[00:23:21] Thank you.
[00:23:26] So we're going to talk about what we tested at Scipio and Hocus. Uh, I need to figure out how it works. I need to aim the thing, right? Yeah. Uh, so that's the Hocus logo. What does Hocus do, actually? What is Hocus's business? Hocus, in fact, they have several products, but we're mainly going to talk about a product which is the marketplace, which is the biggest. Hocus delivers photos, they deliver photos at scale. So they deliver photos at scale. Uh, what does that mean? That means there are clients like, these are real clients, Uber Eats, Smartbox, or Necty, who need photos of food, Smartbox of travel destinations, and Necty of apartments. Uh, and so they say, okay, well, we need photos of this apartment in this location, and the point of contact on site is this person.
[00:24:18] Um, what the client wants is to receive as quickly as possible. So they have SLAs where it's generally 5 to 7 days. Uh, the photos of uh, of the place they gave with good quality, so they give a certain number of guidelines which are quite different. I want, I don't know, I'm next City, I want something well lit, uh, I want it to be clean, no people in the photo, etc.
[00:24:42] From the inside, what does it look like? So simplified, the client, in fact, he will
[00:24:47] when he's connected to it, he's going to call the web API, or else they're going to have a small front-end to be able to do it.
[00:24:54] Uh, inside they will be able to, therefore, order photos. It goes through, uh, there's a whole flow inside, so there's a back office application where the production team will try to find a photographer. Sometimes it's automatic. The photographer will have an app to be able to, to be able to have all the information he needs, or he needs to go, uh, to go take the photo. Uh, so the photographer submits the photo to the platform, it goes to a small operations team. which will, which will look at the quality of the photo, say it's good, it's okay, or refuse it, or of the group of photos. They're going to send that to a service provider who's going to do the retouching. Once they've finished the retouching, they send it back to the API, it's returned to the client, all that as quickly as possible. The problem, uh, one of the problems, because there were many everywhere, one of the main problems is that uh, the operations team, uh, it's uh, already it's at the end of the flow. So there are a lot of lead time problems, because as soon as it takes more than one day or two days, that's it, we've broken the SLA. And 35% of their workload, in fact, it was, they managed to link them to bugs, I went to check and indeed, it was indeed the case. Uh, Ocus is a company of 50-80 people, an engineering team of 25 people between product and tech, to give you an idea of the order of magnitude. And so we have, uh, when I arrived, so it was, it was, uh, mid-May, uh, for, so the co-founder asked me to look at the quality and, uh, and to put in place a quality culture, so, uh, it wasn't very well defined. So I looked at the thing and, uh, the first challenge was to reduce the bug stock, there was a bug stock, uh, so now when I say bug, it's a defect of type C or D. That is, they were reported either by internal teams or by clients who generally go through internal teams to report them, they are also, there are many possible paths. But so it's C or D compared to what we said earlier. The first challenge, then, was the fact of reducing the stock. When I arrived, there was a stock of 50-300 bugs, here you see in dark red, these are the new ones per day. And in fact it fluctuates because they solve bugs every day.
[00:27:09] So I was talking about the frustration of the other teams who saw their bugs not progressing. There was a pretty significant lead time, very variable, some bugs that didn't take much time and were closed pretty quickly because they generally weren't bugs, uh, we're going to see that. And uh, others that were actually never closed and that had been living there for 100 days. And in fact, when we looked at them, we thought, is it still interesting? The answer is generally yes. Uh, so the method that was used until then was two things. The first is that bugs were reported via a form written in GitLab, which is good, it worked very well. Uh, the second is that the PO is at the center, product owner is at the center of the machine, the product owner decides what the team does, decided anyway, there in this model, what the team did on a daily basis. And so in fact, they put the bugs opposite the features, they said, well, the features are important. So we're not going to focus on this bug, we're going to let it drag on, and then generally it ended up dragging on until the person, the reporter, started complaining and saying, come on, that really bothers me a lot. Couldn't you look?
[00:28:21] Come on.
[00:28:25] There you go. So, uh, the second thing was, uh, the fact of being too much in quick fix mode. So we saw that with the SA process, so it's not necessarily the tech lead, who is a bit the equivalent of the team leader, who will look at the bugs, it will be each developer, because each developer can do it. But, uh, there's not necessarily dogmatism about that. Uh, when we looked at how a bug was done, this is an example of a bug that actually happened. So there was an input image taken, that's a fake image, but it looks like that. Taken by a photographer, uh, of food, it goes through the API process and in fact the goal is to generate thumbnails so that the client can pre-select things. The client, we also deliver the thumbnails to him. And so the thumbnail was, it was bleached when below. Whereas in fact it should have been the thumbnail of the first image. Uh, what happens, uh, in fact, behind? It's that, uh, when the developer looks, he realizes that the input photo didn't have the correct color profile, with, with, uh, quotation marks, the correct color profile. It was a somewhat bizarre color profile, we still looked at it together, but, uh, well, that's it, and so at the time, in fact, he had, because I arrived after, I looked at this example in particular, which was before. And so at the time, in fact, he had, because I arrived after, I looked at this example in particular, which was before. Uh, he answered that, that's a real, well, it's a copy-paste, I copy-pasted what he did, I took a screenshot. After investigation, we saw that the problem came from edited versions, in fact, you use the SRGB color profile, uh, but in fact, you have to use the SRGB, IEC, etc. profile. Uh, so of course the client knew it and it's just that he was being mean. Well, it's not exactly the client, it's the external service provider who was trying, so the service provider still has a bit of the trade, but from there to say, well, I use SRGB, whatever, and so it's, thank you for your time, thank you for taking the time to fill out the form. Thank you for your time, thank you for taking the time to fill out the form, uh, send us better stuff, thank you, goodbye. Uh, and so now we say, well, we can still do better. And well, I think in the room you've already seen bugs that were resolved like that, where the developer replies something like, well, stop sending us crap and it'll get better.
[00:30:35] Sorry for sensitive ears. So, uh, so the first thing, uh, is, uh, we started on June 1st.
[00:30:45] I've seen everyone and so I've proposed a new way of managing bugs, so that's literally the document I wrote. Uh, to say that quality is important, more than anything. Uh, that we want to reduce the stock to zero and uh, how we're going to do it. And that consequently the tech lead. So that was, ah yes, I said that, I also said that I had spent 40% of my time.
[00:31:10] to do it with them, uh, what I did the time it rolls, so it took uh, 3-4 months. The, we kept that, it worked very well, so we kept the same flow, we slightly reworked the form a bit later. But that came from the teams. Uh, the second thing we changed is to change the intention.
[00:31:28] So the person who reports the bug is now actually handled directly by the tech lead. Uh, or the team, in fact, sometimes on things that are a bit more where they don't necessarily need to intervene, and the tech lead, in addition, generally, we have a front team, a back team. So on a lot of bugs, they have to talk to each other very quickly to be able to act. And a lot of bugs that were lingering because they hadn't talked.
[00:31:52] Uh, secondly, we created a quality team with all the tech leads of the teams. Engineering managers and individual contributors who were in the company and were interested in quality, so these are the driving people who were attracted by this thing, I brought them in and so I also participate. And so I also participate, my goal being to get it off the ground, because since it's an interim CTO mission, it has an end, so I also wanted it to last. And so the goal was to train the engineering manager to take this on for the future. Third thing, then, is to do weekly sessions around the KRCs. So it's the equivalent of what they did in Claim Assy. Uh, so I did it weekly. Uh, daily it was too complicated for now, so it's better to start somewhere. To also share how to solve problems. And these weekly sessions are done with the technical team, the product team, all the team leaders. uh, from other Hocus teams are also invited, there are some who come because it interests them. Uh, and they continue to come. And the two co-founders who are also there, who are there, both of them, 80% of the time, and it's not 50% one, 50% the other, generally we are both 80%.
[00:33:07] Uh, at this meeting there. And so everyone looks at examples, so now we're going to see one. Uh, this is a, it's a, well, these are screenshots, we record them, there are a lot of people, in fact, almost the entire technical team is remote at Hocus. So this is Remy, he's an individual contributor who presents how he looked at network requests, uh, the number of bytes of network requests to try to debug a file upload problem on AWS. Uh, this is Olivier who is a front-end team leader, uh, who presents different things he tried on some cases. Uh, this is, uh,
[00:33:45] I was telling myself that I would have a stupid memory lapse on one of the people, on one of the techs.
[00:33:49] You wrote.
[00:33:51] Uh? Ah yes, it's Edwin, sorry. So it's Edwin, Edwin is not part of the technical team, Edwin is a team leader of the operations team and in fact he is interested in a lot of problems that are there. And so, and so there he was discussing a little bit about connection problems with the photo editor. And, uh, Lionel arrives. Ah.
[00:34:15] And Lionel who was presenting, so he was presenting a diagram of, he was doing weak point management, so he was looking, in fact, at all the potential problems precisely with the same photo editor. So on the right it's me who is smiling beatifically, and uh, the co-founder of Hocus, Julien.
[00:34:34] Go. So to look at one of the KRCs that was done by Remi, another Remi. Uh, Remi, he's interested in the performance of the API. He's interested in performance. And in fact, they set performance criteria on some endpoints, but not only that, they also had an API endpoint. uh, which sent timeouts. The problem is that it's the invoicing endpoint that is used to pay the photographers at the end of the month. And this, this endpoint kept breaking. So this is a type C bug, it's the one that got caught in the machine.
[00:35:08] on prod and we're going to try to solve it.
[00:35:12] Uh, when he comes back, it's very slow, I'm going to do it pretty quickly because I see that it's, I see that it's eight. Uh, so he evaluated the business impact quickly. He also looks at what the bug looks like exactly. So he shows us a graph below that's interesting. which on the x-axis is the number of requests and the color is red-orange depending on whether the network request took too long. And orange, for example, is requests that took more than 6 seconds, so you can imagine the user opposite.
[00:35:39] Uh, and so then he explains what he did to fix it. So with code, we're not afraid to show code in these sessions there. Uh, and then he, he shows at the end what his correction gave. That is, the graph that is all the way down there on the left, which goes from orange, orange, orange, and then a big yellow peak, in fact, what does that mean? It means that all the way at the end, that was the end of the month on the right, the peak is actually the number of requests to the invoicing endpoint and they are yellow because he solved the problem and they all take less than half a second.
[00:36:12] Behind, uh, he also explains all the logic, so, uh, of what he did to be able to, uh, to be able to investigate the problem. So he tried to see how it was introduced, which developer did it. It was too old so he didn't dig much. Uh, he also looked at uh, what metrics we could set, what alerting we could set up. Uh, he also saw that there were problems, so you talked earlier about N+1 query in SQL. Or algorithm optimization problems. And uh, precisely in fact, at this meeting there, uh, the co-founder and CEO said, but in fact, uh, have you asked yourself the question of training people? Uh, because in fact, uh, there are things about the algorithms. And so he said, well, I didn't think about it, and two days later he came back to me saying, oh, I'm going to do training for everyone. So it also serves that, we don't need to do perfect things, we especially need to make sure that people think about all the interesting things. And so we've changed the way we work, so in fact, each result, in fact, each innovation that each person brings there on training, uh, so that's the human angle. On the methods, how are we going to change our way of working, how do we change the way our machines are configured? How do we change the materials, so the input information. Uh, all that, in fact, we're going to work on it continuously, so with 35 KRCs. We, we changed in four months, uh, the way we worked quite, quite radically. And, uh, I'll do it pretty quickly. In fact, uh, first we, we tried to identify the bugs that we could close quickly, in fact, we very quickly hit a wall. Uh, the team was taken throughout the month of August on a very important project, uh, so uh, it started to slip back. There were, there were vacations in addition. We recovered the thing in September to be able to put everything back, now we've gone from 50 to 12.
[00:38:03] 60 to 12.
[00:38:07] So there are several difficulties, the first is to do it end-to-end, all the way. Uh, so that each developer can really understand the problem of the person opposite and really solve it. The second is that it's, uh, it's really the most important people, so we have, uh, there are things where the developers were discussing by GitLab comments. Uh, and so we have, uh, there are things where the developers were discussing by GitLab comments and so on and it took a crazy amount of time. So they took a lot more, they came out of their shell a bit. Uh, keep the rhythm, so by supporting the engineering manager, by supporting the effort to solve bugs. And to always try to be faster in solving it.
[00:38:49] So it shows results, we have fewer and fewer bugs coming. So this is per quarter, Q4 is not finished, but we already see that on a pro rata basis, we have twice less. And we are more and more able to fix them in one working day. So it's fixed, it's not the complete 8-step loop, it's just at least already the bug is more impactful. Uh, and that we also try afterwards to improve on the global resolution.
[00:39:17] I'll pass you the mic.
[00:39:18] Thank you. I'm going to share my feedback, so it's interesting because we were able to follow our experiments in parallel, we didn't do exactly the same thing. And so, so Scipio, we are a service company, so we support clients. And so the pioneering client on which I tested Danto Tsu is the digital of BPI France. So in 2019, we had a team, so they are all at this table, I wasn't there by the way at this meal, too bad.
[00:39:45] And three years later, in fact, we had a pretty rapid scale, so we have around 100 people involved who work today, so developers, POs, all kinds of profiles on digital. So, I think I was there, however, yes, I'm at the bottom left there. And, uh, in fact, this scale also happened because the client was very happy with what we were doing, we had great successes, we launched the state-guaranteed loan in 5 days. We have very good NPS on the applications we develop with them, the end users are pretty happy with what we do. It's an NPS of 67. And then we were able to experiment with a lot of things, I think there were talks on it about Conway's law, we decoupled the architecture, we did team topologies, we did a lot of cool stuff. On the other hand, uh, the scale caught up with us. So you see, it was last year, uh, production bugs on this client, so they increased significantly. until it became a priority problem from the client's point of view. So here I put a quote that I translated into English. But you don't want to have that quote from your client, I think, if you've already had similar quotes. Uh, because in fact, it was starting to harm the reputation of BPI France's digital towards the business lines, towards the users, etc. So we were starting to pose a real problem. So I decided to react. Uh, and it happened a bit by chance, I stumbled upon a video by Michael Ballet, who you might know if you know the Lean. Uh, who was talking about Lean in the quality function, so what was the role of the QA function in a company. And so there was this little book at the bottom right, so it spoke to me, the radical quality approach. So I bought it, I read it in a somewhat unusual place, because I was doing a trek in Morocco when I read it, I think I'm the only one to have read a Lean book in the Atlas. Besides, I didn't know what a trek was, so it was all the more difficult. Uh, I thought there would be more camels and cars, anyway. And so I come back from vacation super excited about the idea of starting to implement all the things I read in the book where it talks about oil leaks, it talks about cars, it talks about very different things, I told myself how does this apply to us. So I did physical visual management first. So we see there the four top sheets, it allowed the teams to note their bugs A, B, C, D. And the bottom sheets allowed the teams to note quite synthetic analyses. Where we even had the piece of code that was printed and so the team explained what was the cause that allowed the bug to appear and what was the cause of us detecting it so late.
[00:42:15] I had an example of QRC but as we don't have much time and Flavien has already described it, I'm going to pass on it. But so in theory we had a format that was quite complete where we had the problem, the impact of the defect. We had found the user story that had introduced the bug. We had found the piece of code that had introduced the bug. We had detailed which steps, particularly of review and testing, had led to its appearance. And then we had found errors, that is, what were the errors that had led to the appearance of this bug, to deduce, in fact, root causes for the appearance of this bug. So on the left are the reasons why the bug appeared. So for example, we didn't know how to correctly store and exchange amounts, we had a supplier who sent us an amount as a character string in hundreds of millions of euros, and that caused a bug. And on the right, uh, what caused us to detect it so late? So there we are more interested in testing, uh, in review, etc.
[00:43:11] And so, uh, it had positive effects, it really created a much stronger quality culture in the company, uh, we are many to have built a system that allows us to monitor bugs at the company level. I saw in the discussions I had with the teams that we were much more in in-depth discussions on quality, on craft, uh, rather than saying I didn't pay attention, it was, uh, I poorly applied Martin Fowler's refactoring. it was, uh, I poorly applied Martin Fowler's refactoring and he gave me the name of the refactoring, so much more in-depth discussions.
[00:43:35] But on the other hand, I still had problems, that is, the approach of physical visual management was not very adapted to teams that were partially remote. I realize that it was harder to do it with teams that had significant stock problems, as you mentioned earlier in your case at Ocus. I realized that there were sometimes bugs that had been introduced 2 years ago and that consequently the dev was no longer there, so to understand why he had made the error, it was not very interesting, in fact. I saw that it took them time, so the thing I showed you before, for someone who knows how to do this type of analysis, it still takes 2 hours, so when you have to do it for all the bugs, it's not compatible with a tech lead's or a dev's schedule.
[00:44:11] And then it often went into subjects that were not really technical mastery subjects, it was often subjects of, we are pressured by delivery or it's the other team that did it wrong, so we were very much on collaboration subjects or conditions. What is, what is obviously not ignored, but I also wanted to generate technical learning through this activity. And then I was also a limiting factor because I couldn't spend enough time with all the teams and so I realized that I couldn't. And so we, we designed a new system to respond to these countermeasures. So the first thing is that we have a much lighter format that fits on one slide, in fact, where we see the description of the defect from the user's point of view, we have the defective code, we have one cause of introduction of the defect. Which is a cause that I want to be around a technical gesture, so there it was a very specific topic of serialization when converting Java to Kotlin, and a cause of non-detection. And so with local countermeasures for the team.
[00:45:14] I also took a pilot project, so Thibault who was the tech lead of a team with whom we did KRCs every day at 6pm, notably with the group's CTO and myself. It motivated him so much that he gave a talk at Human Tox that resembles today's, but at his team's level. And it had a super strong impact. So these are the bugs in prod in, so the detected bugs, and so in 3 months, we still have an 80% reduction in bugs that were introduced by the team. So it really had very strong effects on this pilot team.
[00:45:44] However, at Scipio, there was earlier a conference on what is noise versus what is a real cause. For now, I think we are more at the stage of noise, what I'm aiming for is to divide by reducing by 88% in 3 years. So there I succeeded in September, but I see that July was very complicated. I see that there is a lot of noise, indeed, as was said in the conference earlier, notably to what extent the team is rigorous in noting all the bugs. I notice that when a team launches a bug project, there are many more recipes, many more QA, much more rigor, and so that strongly affects my perf. So I realize that I'm going to have to do what I did with this pilot team. still team by team for it to scale.
[00:46:24] Last point, perhaps, that was still very useful. As a CTO, it allowed me to see what were the typical problems, the real weak points of the teams. So we have identified, uh, eight gestures, we're going to give ourselves a maximum of 10, to really build a training academy with a company called I know, Regis, who is its founder, gave a talk yesterday, to say what are the key gestures that a developer must master in order not to introduce bugs. And we don't want to have more than 10, we want them to be really the fundamentals, and on these gestures, we have created standards, courses, and even simulators. So where the person can train to reread refactorings, for example, to say if it's a good refactoring or not, with Martin Fowler's checkpoints. So it allows us to invest well in training.
[00:47:07] Conclusion.
[00:47:09] Conclusion. Uh, so the goal of the conclusion is for you to leave with.
[00:47:14] three or four key points on which we agreed and which, on which our experiences, which are quite diversified, in fact, we didn't do the same things at all, and that's why we are very complementary, I think. The first thing is the buying of top management. It's that top management shows interest, comes to really be interested, asks why, maybe challenges the developers a bit too, that also creates a link. uh, between the co-founders and the developers, which with the scale, especially in startup contexts, tends to disappear. The last, the second thing is problem-solving training. Problem-solving is very hard. Just doing a QRC in 2 hours is, uh, there are few people who succeed because, uh, because in fact you have to think about everything. So you have to have practiced and really thought about all these little things and have detective work. So it takes a lot of coaching to do it, practice, and sharing also with other teams.
[00:48:18] And two ideas that were clear, clear to me are to focus on one technical learning rather than sweeping through a whole tree of causes, because each bug generally has at least 10 different causes that appear.
[00:48:30] So focus on one that allows the developer to progress technically. Because I noticed that it creates more interest in them when they start with bugs where they learn things about their job.
[00:48:42] And then the last point, which is important and which is really the beginning of Danto Tsu, is to really measure. So it was complicated, it took me several months to have a bug measurement at Scipio, it takes time. You have to do it and you have to set objectives to be able to create energy and momentum around these Danto Tsu projects, and so that's really why it's the first step.
[00:49:02] We, we put here in summary and we will be happy to share with you our templates of KRC that are a bit different, so this is Hocus's that is here. Uh, I will make the slides available and you can ask us for them too.
[00:49:13] It stays on our notions, both.
[00:49:15] Exactly, it's Notion for both, uh, and mine that I showed earlier, so you can test the formats that suit you and perhaps adapt them to your context.
[00:49:26] That's it for our talk. Thank you very much to the organizers, to the sponsors and to the public.
[00:49:32] Yes, thank you to the public.
[00:49:34] And uh,

Transcript (Translated)