Alain Hélaïli
Transcript (Translated)
Hello everyone, my name is Alain El Haïdi, as you may have seen and as it's written, I work at GitHub. And today, while trying to catch up a bit on our delay, I'm going to tell you how we operate. How we operate, given the few particularities we have. Namely, we are a company that has now been in existence for 9 years. We are now a bit over 730 employees. It grows a little every day. So this slide is from a few weeks ago. And we are 730, and as you can see, we are very, very, very distributed across the planet. And so this creates quite a bit of complexity in our organization and in our way of working.
In fact, do you speak French a little?
Yes? Are you a non-French person? No? So if you understand French, I'll keep it. Otherwise, I could change it. Okay. So we have quite a few people all over the planet. And these people are notably engineers, so they are notably people who produce code. And so, we had to find together a way of working, an effective way of working that still allows us, as a company that works mainly on the web and in SaaS mode, to find a way to operate that allows us to deploy very, very frequently. So when I say very frequently, here you have the number of deployments to production on GitHub.com over one week. So on average, that's about 80 to 90 deployments to production per day. GitHub.com, for those who are less familiar, has about 25 million registered users, 60 million unique visitors per month.
Right now, with Black Friday, we've dropped a bit because it's mostly e-commerce sites that are rising, but we navigate between the 50th and 60th place of global websites on the web. For a developer site, that's still pretty good. And you can imagine that the infrastructure to manage that behind the scenes is quite substantial. We're talking about several thousand servers in production for GitHub.com. So there are these two aspects: deploying very quickly, deploying continuously, and we'll see why we want to deploy continuously, and then all of this with a completely decentralized and distributed organization across the planet. So how do we operate? How do we operate so that this software creation machine works well? The first thing, already, is that, surprise, we use GitHub a lot. But we don't just use it for development; we use it for many things that aren't necessarily related to development. At GitHub, the legal team uses GitHub, the sales team uses GitHub, the marketing team uses GitHub, everyone uses GitHub. The first thing we did was stop sending emails. So at GitHub, if you want to talk to someone, if you want to communicate with one of your colleagues, in any division, you don't send them an email; you go into a repo and create an issue. Our means of communication is an issue. There are many advantages. This means, by the way, that marketing has a repository, the legal and jurist department has their repository, everyone, every team has their repository, for some it's code, but for others, it's not code; in fact, it will be markdown, files, texts, GitHub-style, let's say. So when I want to dialogue with a lawyer, for example, I go into the lawyers' repo and create an issue in it. The advantage is that it's indexed by GitHub, it's visible to everyone, so we have open communication, and then an issue is somewhere like an email but better because like an email, I have a title, like an email, I have content, but unlike an email, I have metadata, I can put labels on it, I can say what type of discussion it is I can say who is responsible for the discussion. There is the concept of the DRI (Directly Responsible Individual). I can say that these people are responsible for managing this discussion and potentially summarizing the discussion at the end and synthesizing all the opinions and reaching a decision on it. I can set milestones, I can see the list of participants. I am sure that this discussion is the only discussion that exists. If you have an email, it can go in all directions. I can forward it to two people who themselves forward it to three other people, who don't reply to everyone but have a sub-list, etc. So it can go in all directions, and I'm not sure I have the complete view of all the exchanges that took place around it. And in addition, when you mention this discussion, this issue in another issue, there is automatically a link that is created, and I can navigate step by step like that to see the related topics of this discussion. And in addition, I can have someone who will lock the conversation at the end by saying, 'It's finished, now we don't talk about it anymore, we've reached the end of the topic, and we move on to the next step.' So that's super interesting for us. And one of the values of this, if you're not necessarily convinced by the approach, is that I can be a new employee at GitHub and access all the conversations we've had in the 9 years of the company's existence. When I arrive as a new employee at GitHub, I'm not in a tunnel where I have absolutely no vision of the company's history. And I will be able to understand how decisions were made in the past, what opinions were changed, what the situation was at the time. And I will be... I will have better empathy, let's say, towards my colleagues who made that decision at the time with a given context that I don't know. It's not today, which can be completely different. But sometimes, decisions can seem bad today, whereas on the day they were made, they were rather good. So that's one of the fundamental elements. And it's really the cornerstone of our system. Because being completely decentralized at GitHub, we can't rely on oral culture to make decisions. When I have an engineer in Australia who is on the same team as two other people in Scotland and another on the west coast of the United States, oral culture, all of a sudden, is not effective. We can't have a call between these different people. They obviously can't meet. We can't do a scrum meeting. It's not possible. Even a call isn't possible since they are in completely different time zones. So this completely written part is super important. And at GitHub, we've even come to say that if a discussion isn't written down somewhere, it never happened. If a discussion doesn't have a URL that I can share with someone, it never happened. And obviously, an issue is the place for our discussions, and it necessarily has a URL that I can share with someone so they can understand what happened. And again, it's archived; it stays there forever. So that's super important for us. Obviously, these discussions can be organized through our projects in GitHub, with these boards that allow us to prioritize and track the progress of things.
After that, when we work on code, we also have this tool that is super interesting for us and hopefully for you too. It's the concept of a pull request. So the pull request, for those who are less familiar with GitHub, is a discussion, a bit like issues, a discussion that we attach to files we are modifying. Traditionally, these files are code. But it can also be markdown. So again, our lawyers, when they work on a document, when they work on a contract, when they do things like that, potentially, they will do it through markdown. And then they will discuss this file through the pull request. They will see the different revisions. They will see how it evolves. If you want, you could do it with a Google Doc or other things like that. But here, you really have the full history of everything that happened, who modified which line, when, why, etc. So it's traceability that goes much further than what you can have in Google Docs. However, it requires a slightly higher level of training than a Google Doc. It's not immediately understandable by anyone, but all our lawyers have always managed to do it, all our new salespeople have always managed to do it. The entry barrier isn't that high in the end. So we work on that. And so, it works well in the non-development context, let's say. And in the development context, it's also super interesting. So, for those who are less familiar with this system, the idea is to have discussions around the code we are creating, to have discussions as early as possible, and to have discussions at the end on the finished product. The discussion as early as possible, I insist on that, very often people open a pull request because at the end of the pull request, we have code that has been created, and our developers will go do what we call a code review. So they will read each other's code and say what you wrote is good, it's not good, you need to change this, etc. And generally, they open the pull request at the end, once the work has been done. So we encourage opening it very early to discuss with the rest of the team what we are going to do, how we are going to do it, and what impacts will be involved. Through this opening of the pull request. By sharing very early what we are going to do, we can reach a consensus on the roadmap of what we will develop. So for a particular feature, we can say here's how I'm going to do it, here's what I'm going to modify, do you agree with me? If people agree with me, I can start working. And afterward, when my colleagues go to review my code, when they do a code review, they will know exactly what we are talking about. They know what to expect; they are not surprised. So they can provide me with feedback that is positive, constructive, and we won't waste our time in ideological debates about coding concepts because we agreed from the start on the direction we were going to take.
from the outset on the direction we were going to go. And this code review process, if it's not... something very common in your teams, I really encourage you to implement it. Because code review is already what allows us to catch a maximum of bugs, but it's also something that allows people to learn. And this notion of learning about code and development techniques is very interesting. Because by being a reviewer, by rereading someone's code, I will learn, I will discover their programming techniques, their programming style—it's a bit like an author's style—and I will discover new things. And then I will also learn, I will also educate others with my own ideas, with my ways of doing things, etc. So it's really a bidirectional exchange. And very often, teams say, we'll take our best developer, and it's him who will review everyone's code. And then it's him who will somewhat spread his own practices. By doing this, we deprive ourselves of a great wealth of exchanges between different people and mutual sharing of information. And again, a junior can review a senior's code because they learn a tremendous amount from it.
There is the case, for example, of Noroto in the north of France. They had decided to move to new languages, like Scala for example. And there were only very few developers who knew Scala. They forced all the other developers to review each other's Scala code. They didn't understand anything at first. But by rereading it repeatedly, they began to understand not only the language but also the philosophy behind it. And afterward, they became skilled Scala developers. So indeed... by forcing such reviews, they gradually understood what was happening in the code and were able to improve their skills. Also, by distributing reviews like this within the team, we reduce what is called the truck factor. Does the truck factor mean anything to you? The truck factor is the impact that one of your colleagues being hit by a truck would have on your project. So the higher my truck factor, the more resistant, let's say, I am to the disappearance of one of my employees on my project. So I have spread the knowledge of my code across my entire team. And therefore, I will be more resilient, let's say. So that's the pull request; it's very interesting, and again, for code but also for markdown, text, and such things, you can use it; it's very useful. And then, well, I'll go over this quickly; we have plenty of tools that provide us with information, that look at our code and tell us what's happening. And this is also on a pull request; we have about twenty jobs that provide us with information every time we push code, giving us feedback on the state of our code. So a developer knows, for example, that in the first part, the int part, for instance, there is an error in their code, and they need to review it. And you see that here, we use continuous integration; we use these concepts. Very frequently, continuous integration is a single line that tells you... everything went well, or something didn't go well. Afterward, it's up to the developer to look into this system to find out why something, which we don't yet know what it is, didn't go well. And here, the idea is to provide them with as much information as possible so that, ideally, continuous integration is something they know exists but don't need to worry about what happens inside or how it works. We provide them with higher-level information and put it right in front of their eyes, so they don't have to go looking for it. The information is pre-chewed and comes directly to them. Additionally, you see that some elements, some checks, are mandatory, while others are optional. So they also know what the minimum quality level is that they are required to meet. to meet, in fact. And in addition, afterward, we also provide them with information on all the deployments that have taken place. So that's for the plumbing.
And if I summarize this, it's what we call the GitHub Flow at our place. The idea is that a developer creates a feature branch. There are many debates around the concept of a feature branch. Some have feature branches that last for months. A feature branch, for us, lasts between a few minutes and a few days, or even a few weeks, but never much longer than that. So they create a feature branch, add commits, open a pull request very early, and discuss their code with other developers. And then very quickly, they deliver. So very quickly, this feature branch, And in addition, afterward, we also feed back information to them about all the deployments that have taken place. So that's for the plumbing.
And if I summarize this, it's what we call GitHub Flow at our place. The idea is that a developer creates a feature branch, what we call a feature branch. There is a lot of debate around the concept of feature branches. Some have feature branches that last for months. A feature branch, at our place, lasts between a few minutes and a few days, or even a few weeks, but never much longer than that. So they create a feature branch, add commits, open a pull request very early, and discuss their code with other developers. And then very quickly, they deliver. So very, very quickly, this feature branch, this small functionality they worked on, they will deliver it. They will deliver it to test environments, pre-production environments, but they will also deliver it to production. Because at our place, we decided that a feature or a piece of code is only valid if it has already gone into production. So to put the 'production ready' stamp on a piece of code, it must have gone into production. Only production can tell us whether a piece of code is compatible with production or not. I'll come back to that. But the idea is, and we'll come back to this later, that saying a task is finished when it's not yet in production is lying to yourself. Because how can you know it's finished from the moment you don't know if it works in production? Potentially, you'll have to redo it. Review it because it can't be deployed, for example. So it's not finished. So all our developers are led to push their code to production. Afterward, the notion of what we push to production is de facto quite different from what you're used to pushing. Generally, you only push features that are finished. We have the particularity of pushing things that are not finished, that are in the process of being developed. From the moment I can execute it in production, one way or another, I will want to push it. I will want to push it as quickly as possible. So as soon as I have a piece of code that I can activate one way or another, and we'll see what I mean by 'one way or another' later, how I can activate them, but as soon as I have a piece of code that can work in production, at that point, I will push it quickly.
This results in our number of deployments to production being 80-90 per day. So how do we do this with a completely distributed team around the planet to do it as quickly as possible? And the idea is to get feedback as quickly as possible as well. Knowing that once again, we have thousands of servers.
To do this, we created a bot.
This goes back 5-6 years now, a bot called Ubot. It's what we call a chat ops bot. It's a term we created. Chat ops means doing ops through a chat client. So Ubot connects to our chat. We used to use Fire before, now we use Slack, but you can use whatever you want. It's an open-source project, so you can use it tomorrow. It connects to many different chat tools. And Ubot connects to this chat, and we send it commands. And it's Ubot that does all our work. In fact, there was even an article in Wired about Ubot, and it said that it was the GitHub employee who did the most work. But it's really the cornerstone of everything we do. By the way, we made cartoons; if you go to our YouTube channel, you'll see that we made animated films about Ubot. Incidentally, we are one of the only IT companies to have three people in an animation studio and two people in a video studio. We create our own cartoons, we create our own videos ourselves as well. So we do a lot of slightly crazy things. We made giant balloon sculptures of the bot at our events. It's quite important. And then, it allows us to do a lot of things. So here is our Slack interface. And then, I can ask it to get me the map of a soccer field, for example. I can ask it to list all the food trucks near the office. I can ask it to find images, animated gifs, etc. So, it's used for a lot of useless things, but also for useful things. Okay? So, for example, from the bot. It is constantly listening to what's happening in our repositories, since that's where everything happens, and it goes into the Slack rooms that are linked to those repositories, for example, the sales team's Slack room, it will watch everything happening in the sales repos, and as soon as something happens, in the Slack room, it will display, 'Hey, so-and-so committed this, so-and-so opened that.' And conversely, it will watch what's happening in Slack and detect if we're discussing an issue or a pull request. And so if it detects that, it will go into the issue or the pull request, create a comment with a link, so that people looking at the issue will say, 'Hey, someone talked about this issue' in a Slack discussion, I can click on it and go see that archived Slack discussion and follow the thread of discussions that may have taken place in one tool or another.' So everything is connected. So it really helps us daily, and all the systems we use daily are interfaced via Ubot. And it's Ubot that has all the keys to our entire system. We don't need to have logins for any of the systems, since most of the time, there's a Ubot script that knows how to connect to that system and bring us the information. For example, Salesforce, I don't need to log into Salesforce to check the status of my clients or things like that. I can ask Ubot to retrieve all the information we have in Salesforce about a given client and display it directly. But well, the first use case for Ubot was really everything related to production deployments. So when an engineer wants to deploy something to production, what do they do? They go into Slack, there's a specific chat room for that, and then they can type a command, it's WCID, a shortcut for saying 'where can I deploy?' " And Ubot will give them all our... deployment environments, so the qualification environments, pre-production environments, etc. And all are listed there. So some are virtual, some are physical.
So some, the physical ones obviously, only one person can use them at a time, so there's a lock concept, and production obviously has a lock on it. So when they want to deploy to production, for example, in Slack, they will say, 'I want to get in the deployment queue.' Ubot will manage a queue, and only one developer at a time can deploy their code to production. We call this 'testing in prod.'
They will get in the queue, and Ubot will tell them, for example, here, one person is already working in production, another is in the queue, so you will be after. Potentially, you can be after. The developer will type another command, they will say, 'queue me to deploy.' The queue seems manageable in terms of waiting time. So I'll get in the queue, and then notify me when it's my turn. And then a few minutes later, the developer will be able to say, 'Okay, I want to deploy my feature, so .deploy, my feature to the production environment.' "
And that's it. That's all you have to do to deploy code to production in a stylish way. So you see, it's very easy to do this 80 times a day, it's very easy to distribute it all over the planet, since there's no way a developer can make a mistake in this. I have no way to endanger production, unless obviously I write completely crappy code.
But it's very, very, very difficult to mess up. I can't. I won't make a bad operation, a bad manipulation on a production server, since I don't touch the production server. It's completely automated and completely abstracted. And the idea here is also that a new employee at GitHub can deploy to production after 10 minutes. They can't break anything. It's very easy. It's impressive at first. You tell yourself, I've grayed it out here, but roughly, you see, the number of servers that were impacted in a deployment like this is about 900. Today, we have about 900 servers deployed roughly 80 times a day with this system.
So, it's impressive, but in the end, once you've done it, you think, 'Oh, that's all it is,' and so I can do it as many times as I want in the end. So, Ubot does quite a few checks. Ubot ensures that it compiles, that the continuous integration has passed, that... I'm not late, I'm not removing features. There are plenty of checks done to secure this. But in the end, it's quite simple. And in addition, after that, what we give the developer is that we tell them, now that you've deployed something to production, for the next few minutes, you will be in charge of monitoring what you've just deployed. Because the idea is also that it's the developer who just pushed something who is best able to detect if there's a problem in their code with the production information. So we give them access to this application called A-Stack, so it's the haystack, and it's up to them to find the needles in the haystack, in fact. So we show them all the needles, so all the needles, All the exceptions in production are what we call needles. So it's up to them to analyze them. And then what you see at the top, again with the resolution we can't see well, but there's a timeline. Here we see roughly two hours elapsed from left to right. And all the photos actually correspond to pull requests that have just been deployed. So all the photos are the developers who have deployed their work to production. And what I can do, if I see errors during these two hours, is that I can click on someone's photo, it will isolate the time period between two photos, and then it will also isolate the app request involved in that deployment. And so I can, on an exception, go very quickly to the app request that contains only the bootcode that was just deployed. I don't have two months of deployment, I have two minutes, two weeks, two days, I don't know what the duration of the pull request was, but in fact, I can have a very quick link between the code that was developed and the exception that was generated.
My investigation time is very very short, in fact, if I ever have a problem. And in addition, since I have very fast deployments, I can choose either to roll back, rolling back means redeploying another branch, or to change the code directly and fix the error. Because, again, I have a short resolution time because I have a very small amount of code to analyze. So this gives me a lot of agility in how to correct any... And we'll also see that the duration, supposedly, of a problem in production will be very very short. Because I have monitoring that is very focused on what I just developed. And on a small piece. So the impacts are very minor. We'll come back to that. So, Ubot does a few other things for me as well. Ubot is the one who receives all the monitoring messages and displays them directly in Slack for us. And then, when an incident is generated, What happens is, a production team, when there's an incident, always does roughly the same things. For this type of incident, I'll check this indicator, this machine, this service, etc. And it's a fairly repetitive process. So our philosophy is that from the moment something is repetitive, we don't do it. We ask Ubot to do it, we code it. And so, when there's a given incident, it's Ubot who opens the ticket, and then it's Ubot who goes to fetch the information we usually look at. So the status of the services, the production graphs, the monitoring graphs, all that, it puts directly in the issue. And so the production engineer doesn't need to do this first level of analysis. They have everything in front of them and can work directly from there and go further. And what's super interesting is that since all our systems, both for monitoring and log analysis, we use Plug for example, all these tools are interfaced via Ubot. So when my engineers are in a crisis situation, they go into a Slack room and then they show the graphs they've retrieved via Ubot. They say, look, the CPU graph. And everyone, all those in the Slack room, see this graph. And so we're not saying, you should go see this graph in this dashboard, etc. We say, look at this graph, what do you think? And the discussion, in a completely decentralized way, happens in this Slack room, via commands sent to Ubot, and all the information comes up there, and the analysis is done in a concerted and common way. And here again, it's a bit like the pull request, you have engineers who will just watch, who will learn the mechanisms and reflexes of others, more experienced, and it's almost like a movie
or a real-time report that we follow, because we have engineers who are really strong, and so you learn a lot just by observing what happens in the Slack room, because you have all the exchanges, you see everything they see, you don't miss anything, and you can really learn a lot by working this way. So, we've deployed our branch to production, and if everything went well, we can merge with Master. Okay? So, we've deployed our feature branch to production, we've seen what happened, if we're happy, if there hasn't been an incident, we'll say we merge with Master. If I had had a problem, we redeploy Master. Master was the previous version of our site, and then if we merge with Master, Master becomes the new officially validated version of our site in production. Very simple. So deployment is really very simple, it's either my feature branch or master, and with that, either I go forward or I go back, but I never have a more complicated case than that, I have the big sub-sheep. with that, either I go forward or I go back, but I never have a more complicated case than that, I have the big sub-sheep. Okay? So generally, when I explain this, people think, 'Wow, that's great, it's awesome.' And then very quickly, they ask a lot of questions about the details, because there, I went a bit fast, I painted you a picture that was all rosy, but... We often have...
Mine isn't moving on my screen. And we often have questions, and especially, when I talk to you about this, you've seen that I didn't talk to you about sprints. We don't have the concept of sprints. We don't really have the concept of Kanban. We don't really have the concept of Scrum. We don't know how to do Scrum. We're scattered all over the planet, so it's impossible to do.
We have a QA team, an acceptance team, etc. And yet, we don't have a person in charge of acceptance. We have a QM team that works very differently from traditional QA teams. And then, we don't really have the concept of release. since we deploy continuously. We still have the concept of release, because there's also a version of GitHub called GitHub Enterprise that you can host on your servers, so not in SaaS mode, but in on-premise mode, but it's a bit different, it's separate. But our main way of working is quite simple.
And then, we don't use this thing. I don't know if your teams have already worked with Git. Generally, people go on the internet, they type Git, they type Workflow. They end up with this thing and think, 'Hey, this is the thing to use.'
And I had to try 12 times to understand how this thing works. So, I told myself, from the moment I can't understand it, or I don't understand it immediately, there's something wrong with it. So, I don't know if you use this. Does this ring a bell, or not? Yeah? Use it? No? Yes? No? You stopped? There are many who try and stop. There are some who continue to persist, saying, 'But no, it should work normally.' But no, it doesn't work. It's too complicated. It's too complicated and especially, there are too many manipulations to do.
And there's a comic strip drawing that sums this up. I'll keep it short for you. The guy explains all the processes, all the steps he has to go through to go into production. And in the end, it's so complicated, it's so unrepeatable, there are so many manual interventions and... They end up with this thing and say to themselves, 'Hey, this is the thing to use.'
And I had to retry 12 times to understand how this thing works. So I told myself, from the moment I can't understand it, or I don't understand it immediately, there must be something wrong with it. So I don't know if you use this. Does this ring a bell, or not? Yeah? Do you use it? No? Yes? No? Did you stop? A lot of people try it and then stop. Some keep stubbornly insisting, saying, 'No, it should work normally.' But no, it doesn't work. It's too complicated. It's too complicated, and above all, there are too many manipulations to do.
And there's a comic strip drawing that sums this up. I'll keep it short. The guy explains all the processes, all the steps he has to go through to go into production. And in the end, it's so complicated, so unrepeatable, there are so many manual interventions and...
So many skills needed to do this correctly, that his girlfriend sums it up by saying, 'We also do tests in production.' Because in the end, if you don't have a process that is repeatable, industrialized, automated, and that executes the same way every time, it's necessarily a kind of test each time, since it's a new experience every time. Every production deployment is a new experience. If you do manual interventions for production deployments, it's necessarily a new experience. So in a way, it's a test in production. So if you're going to test in production anyway, you might as well do it directly, but there's no need to carry around this level of complexity that ultimately doesn't bring much. This thing, if we come back to it, comes from our fundamental way of working.
All the people, all the companies I talk to today—it's my job to help clients who want to switch to GitHub, who want to switch to Git, who want to change their way of working. Everyone tells me they're doing agile today. There isn't a single person who tells me, 'No, no, we still do waterfall and old-school stuff.' So they all tell me, 'Yeah, we do agile, we do sprints.' We do two-week sprints.
I say, 'Yeah, cool.' And do you go into production after two weeks? Oh, no. We go into production after a month, after two months, after three months. So the fact of delaying this production deployment, doing lots of back-and-forths like this and delaying the production deployment, means that afterward, if we go back up the chain, we introduce a lot of complexities to be able to do this production deployment. We complicate our lives because of this. I'll come back to this. The idea in the end, and I think everyone here has read *The Phoenix Project*, is that our way of working today means that the time between an idea and going into production is super long. Between the time we have an idea and the time we put it in front of a customer, it's super long. So, we have this... This false understanding of the agile mechanism where we tell ourselves we'll do a demo in front of a PO and that will be enough. But a PO is not a customer. It's not an end user. It's not my 2,000, my 10,000, my 100,000, my 25 million end users. One person, it's two people, and they're not necessarily the real customers. They are people who normally have a better understanding of the business, but they're not the end customer. It's not that unpredictable person, especially if I'm in e-commerce, in B2C, they're not the real truth. So it's not them who will give me feedback on what I'm doing, both from a business perspective and from a technical perspective.
So, it's not interesting. What's interesting is the real feedback from production and real users. So, to sum this up, what's happening today is that everyone has their backlog with their to-dos, they have their work in progress. Then afterward, they do QA, they do acceptance, etc. And that takes time. That's why we don't go into production very quickly. Because we're waiting for feedback from the business. We're waiting for this demo, then potentially waiting for other people to validate the thing, etc. And so, we have these Git ninjas who arrive, do cherry-picking, craft releases, do super complicated stuff. To ultimately deploy. And once again, as I was telling you earlier, we said it was done, but we deploy it afterward. We're not sure it's done since we haven't deployed it yet. We're not sure it's done since we don't have user feedback. We don't know how it's used, we don't know what information it brings back, we don't know any of that. So, what we do, if I put into perspective the way we work at GitHub,
we have this notion of to-do, etc.
Work in progress, but we deploy immediately. We deploy immediately, and so we don't have Git ninjas, but we have developers who go drink beer because they deploy to production, check if it works, and then go drink beer, and then, obviously I'm simplifying, but in fact... From the moment they've deployed to production, all the QA, acceptance, etc., can be done later. We can take the time to do it. The code is in production, that doesn't mean the end customer sees it. That doesn't mean you see it. In any case, the feature, from a user's point of view, isn't necessarily finished yet. But I have my piece of code that is potentially activated by all users and on which I'm logging events, I'm learning about how it works, I'm seeing if there are errors or not when it's activated, but it doesn't impact my users. So it's a somewhat secure way, so to speak, of putting code into production and seeing if it really works and iterating with things that are already in production. And so my colleagues, who will redeploy code to production afterward, will incorporate my experiment. So we won't need to do gymnastics with Git, merges, etc. Later, since everyone builds on the experiments of others and everyone incorporates the code of others. So we don't need to backtrack, do complicated integration, etc. Since we've delivered both in production and in the shared code everything that was done.
QA, acceptance, QA, I was talking about it earlier, we do have QA. We have people who do QA at GitHub, but the people who do QA don't do it on features that are going to be put into production. They do it on what's already in production. They do QA on the system that already exists. And they're more looking for optimizations than qualifying features. And again, we only deliver small pieces. Just small pieces each time. So we're incremental. Completely breaking a system when you're incremental and doing small deliveries is quite complicated. I've done it anyway. I managed to break the system. I could tell you about it. But it's very rare, it's very complicated. And besides, my crash was... A crash from my point of view, but from a user's point of view, it was a glitch. Nevertheless, it's true that the first time, when you're told you're going to deploy to production on github.com, you feel like you're pressing a remote control that's going to blow everything up, and what we want is to avoid getting burned and blowing up with the system. And as I was telling you, I managed once, you see, it was what we saw earlier, and then here's my photo, you can't see it well, but I managed to trigger about 3,500 descriptions per second on github.com in production. Actually, these were JavaScript errors in your browsers.
But as you can see here, it lasted 25 minutes. Since it's JavaScript, there was a fairly significant latency time while people refreshed their browser, reloaded the new JavaScript library, etc.
We're not at a 100% success rate. We still have failures. There are always errors that occur in production. But with traditional systems, the old deployment systems, there has never been a 100% success rate. There's no one here, with all these QA teams, all these testing teams, all these workflows, all these approval processes, etc. No one has a 100% success rate. Nevertheless, with your complicated systems, the time it will take you to identify, correct, and redeploy an application in production will be much longer than what I have here. 25 minutes to return to optimal functioning. That's nothing at all. Detection amount, resolution amount, and on top of that, we didn't roll back. We had time to correct the functionality, so we kept the functionality but had normal operation. We didn't lose any functionality. There was no rollback; there was a redeployment of the fix for that functionality. You see, with this system, In the end, we're not taking more risks than what you have today. And I would say we're even taking fewer. Again, we can still run into problems, but we encounter them much less often, and the business impacts are much lower. I don't have outages that last for days and days. So how do we proceed to also secure this entire system? We have a few frameworks. All these frameworks are open source, so you can use them. They are presumably developed for our technical stack, but there are always equivalents in other technical stacks. For your information, we use Rails, Ruby on Rails, but you have equivalents that have been recreated for JavaScript, Java, .NET, etc. The first thing we use, and if you only do one thing, strictly speaking, for me, this is the most interesting one. It's a framework called Scientist. Scientist, the idea is to learn... To learn the behavior, to learn more about the behavior of a new piece of code in production, while being completely secure. The idea of Scientist is to say, okay, I have a version V1 of my code today, I want to make a version V2 because there are many advantages, because I found a new idea, a new way of doing things. But I want to do this while being sure of my cost. And with this, I will be able to have a user request that will be dispatched to both implementations in production in parallel. And I will be able to, over the long term, compare the results of these two versions.
This means I will let it run in production for a week, and after a week, I can say exactly whether my new way of working functions well, functions better, produces the same results, or produces consistent results, because I don't necessarily expect the best results, but at least consistent ones. And if it crashes, it's not a problem because Scientist protects me from that crash. And this is what allows us to push pieces of unfinished code or half-developed features into production like this. Because we execute them through Scientist. Scientist becomes our entry point for this small piece of code. And behind that, we learn a lot. At site.com, for us, experiments—right now, in Scientist—we have about 200 or 300 running in parallel.
Everything goes through an experiment at some point. My development will start with that. And what I will deliver, strictly speaking, is an experiment. This experiment may potentially be bigger, larger, more ambitious. But I will go through experiments and see how it works. And I learn. I learn very quickly. That's also our entire agility system—having the fastest and smartest feedback loop possible. So we have these tools that allow us to do that. Next, we use feature toggles or feature flippers, as you prefer. So we have a framework for that called Flipper, but there are many other frameworks available. And so, our release cycle for a completed feature goes through feature flipping.
We will create something, and at a given moment, it will only be available to a certain number of people. Two or three developers, the dev team for that part of the site. After that, it will be all GitHub employees who can access it and provide feedback. And then, we can decide to open it to the rest of the world, if you want. But we can also decide to open it only for the Paris region, for example. We have the ability to gradually open the tap and decide who sees that feature. So, in this philosophy of learning progressively and not taking risks, but having things very quickly in production, it also allows us to have a lot of agility. And this is what also allows us to say that we did the QA, we did the acceptance, but we did it in production, and we can release a feature to the public without needing another deployment event. Making a feature available to end users is not correlated with a deployment event. I can deploy on a Tuesday and release the feature to my users the following Monday. In the meantime, I was able to verify that it worked well in production and get feedback from all my employees at GitHub. At GitHub, it's great because we are both the client and the provider, since we make dev tools for devs. So we are both the business and the dev team.
We have this luxury, let's say. So this allows us to work on our code in a pretty nice way. And then, there's always the data issue. We say, if I change my database schema, how does that work? For that, we also created a tool. This one works with MySQL. If you're not a MySQL user, it won't be of much use to you. But with this tool, we are able to perform schema migrations on the fly in production.
So we can say, here are my tables today, here are my tables tomorrow, and migrate all the data at the speed of X transactions per second, because we can adjust this to avoid bringing production to its knees. And we can completely adjust the pace and decide when to make the switch. But we can also do this on temporary tables. And so we can take it a step further by saying, I will have an experiment where V1 will hit the existing production, on RSIQOL, and V2, on which I'm running an experiment, will work on a new schema in a temporary stage, and then I can compare at the database level, if you will, to see that my system is well positioned. Then, if you put Flipper on top, the feature toggle, I can also say, at a given moment, I will use both the new version of the code and the new version of the schema that goes with it, and I have an immediate switch. And so I have synchronized all of that. All this to say that what we do, if you think about it, What we do is a bit... Well, we've talked a lot, there was a time when we talked about Kanban, Toyota, etc. But there's something that was done in the automotive industry a few years ago. Before, what they did was design the car and then think about how they would produce it. And so they had cycles. Between the decision to launch a new model and its actual launch, you had cycles of 7 years. Until one day, they said, while designing the car, we will also design the next manufacturing process. Manufacturing process. We will do both at the same time. And we will iterate on that. And they achieved production cycles of 3 years.
And that's exactly what we do. Until now, very often in IT, you have people who develop and then you have the people who put things into production. And these are two separate processes. The person who creates a feature doesn't think about how it will go into production. They make their feature. Then, we put it into production. And what we do is really mix the two. The developer who develops something, He says to himself, he breaks it down according to the production deployment steps for that particular feature. And he thinks about his production deployment cycle. He considers what information he will gather at a given moment in production to know how he will develop the next iteration. The next iteration is based on feedback from production. And I have that information. And I thought about it when I developed my feature. So I’m not only focused on a business perspective, but I also have knowledge of my production deployment process and what information I can gather from production. This allows us to have developers who are much more precise in their work, much more autonomous, and who take fewer risks and are better informed about what’s happening. And so this allows us to have a lot of love for our robot and to have very happy developers. There, I’ve almost caught up on the delay. If you have any questions, I’m available. Thank you.