Matthias Patzak - Data Mesh – Successful with People, Process, and Technology

[00:00:05] So, the last talk at the evening. I'm Matthias, Matthias Patzac. I'm a German, obviously, and I apologize for my horrible German accent and any pronunciation and grammar issues. So, I'm an enterprise strategist with AWS. No one really knows what an enterprise strategist is. I would translate it in, we are living customer references. So, I'm part of a team out of 13 former customer executives who share their own experience in what we see in the community and how to innovate at scale, introduce lean agile or do cloud transformations. I was CTO of Auto Scout 24, it's a European marketplace for used cars and managing director of Home Shopping Europe, so it's an e-commerce company in...
[00:00:51] in Germany. And I introduced the term data mesh was not coined at that point in time, but we called it a distributed data platform in 2017. And I'm going to share my own experience and what I see in the community, how customers from AWS are implementing data meshes. Who has heard the claim data is the new oil? Some. And it was it was coined in 2008 by a British scientist. And what was the consequence of this person saying data is new oil?
[00:01:36] The consequence was that every larger organization started to store every piece of data they could find in their applications. And in a recent, in a recent conference about data, I heard from a consultant from ThoughtWorks, a company I love to work with, data is the new milk. Because if you use it quickly, it's going bad. Actually, I believe it's not not really true, from my perspective, data is the new wine. So, there are some types of wine, if you use them quickly, they are really good, but if you if you store them for a longer period of time, you cannot longer consume them. While other types of wine, if you produce and store them correctly, they are getting more and more tasty and more and more valuable over time. And this is what we see in the industry. We see that 90% of today's data has been created in the last two years alone. But nevertheless, just 24% of companies characterized themselves as being data driven. So, they store a lot of oil, but they are not really able to leverage their value that is claimed to be in this data.
[00:02:51] And there is a survey by New Vantage Partners, they got now acquired by another company, but if you're in the space of data, I can truly recommend the survey on big data by New Vantage Partners. And they survey the Fortune 5,000 customers in the US. They have a large number of people responding. And for years, the biggest challenge of the business adoption of data, so not building a data lake, but really making value out of data, the business adoption of data is culture. Actually, this number is decreasing, so it's a good sign. Two years ago, it was 90%. Two years ago, 90% of the companies who answered this survey said the biggest challenge to the business adoption of data is culture. But now with generative AI, people are getting more nervous about the technology.
[00:03:50] And why is this? Why is culture, people, communication and process the biggest challenge on being data-driven? And let's have a look in the in the history of data. I started my career in 1995 in an automotive supplier company. And I can remember very well the times when we had this AS400 service. And every morning, this service were printing out this large list of endless paper, green and white, with the materials we as an automotive supplier were used. This is what we call as AWS the data aware area with just transactional databases.
[00:04:33] And if you look at the collaboration patterns, there were just two parties involved. So we had the producers of data, the teams that had to share data. They really didn't want to share data, they just wanted to keep the lights on on their AS400 machines. And we had the consumers, teams that wanted to use data. This was mainly heads of departments. The head of the finance department, the head of the logistics department, the head of the procurement department. And those were mainly the database administrators.
[00:05:10] But after a while, the batch jobs creating this endless lists of green and white paper, took longer and longer and longer. And then as an industry, we came up with a new technology.
[00:05:23] And this is what we call the second generation, the area of data warehouses. So, we came up with a new technology, a very specialized, specific piece of technology. And we created a new team with new specialist roles to handle this new technology. Nowadays, we would call, we applied the inverse Conway's Law maneuver. So, we created an architecture, and following this architecture, we had an organizational pattern. We set up these specialized teams to make the life easier for the producers and the consumers.
[00:06:04] You might say, okay, we set up a loosely coupled organization. What you actually implemented is a proxy between the producers and the consumers. And these specialized teams, they're blocking the communication.
[00:06:20] And then we invented data lakes. A great new technology. With data warehouse, we were just able to handle structured data, now we are able to handle unstructured datas and blobs.
[00:06:33] But the the organizational anti-pattern, the proxy organization between the consumers and the producers is still there.
[00:06:42] And let's have a look at a simplified data lake organization, you will see it's not so simplified. Um, so this is a domain where I'm familiar with. I come out of e-commerce. So, you have a domain and these are six e-commerce teams, you have a team responsible for the homepage, the list page, the product detail page and the checkout page. And you have a fulfillment domain. So, these are teams, they are probably running an SAP system's different modules for procuring,
[00:07:15] buying, storing, handling material, producing stuff, selling stuff, handling returns.
[00:07:25] Then you have a marketing department. It's just a department, 27 people, 150 softwares as a service tools, all sharing a single account. You have the leadership team for sure. And you have a controlling department.
[00:07:41] And you have a finance department.
[00:07:45] And as I said, you have a data lake. And how does the communication patterns in this organizations and the logical analytical flow looks like? No one is talking to each other.
[00:07:59] Everyone is talking to the data lake, and very often even the data lake specialists. And these are really, really capable, really good people, but even they do not talk to the producers or the consumers of the data. So it's really loosely coupled, but it's a broken communication pattern.
[00:08:18] Imagine if someone from marketing needs something from a checkout team out of the e-commerce domain. To whom do they go? If they go to someone of the checkout team and ask for specific data they need for a marketing campaign, what will the answer be? Yes, for sure we create this data. The marketing person is then asking, yeah, where is it stored in the data lake? And the answer will be, I don't know.
[00:08:42] With whom should we talk to? We don't know, ask Pablo. But then you will realize that this central organization that is set up as a bottleneck is overwhelmed by the priorities, the marketing organization, usually it will take them weeks or months to get priority for the use case.
[00:09:01] What is another name for data lake?
[00:09:07] Many people call them data swamps. And it's a bit unfair, but I compare sometimes data legs with a flea market. And treasures are hard to find in a data lake as they are in a flea market if you go there, it's a a large hall with a lot of sometimes treasures, sometimes it's just crappy old piece, but you really need to know where to find the treasures in this data lake and in this flea market.
[00:09:37] And those parties, these three parties, these poor three parties, they have issues. For the producers, the owners of transactional applications. Monoliths, micro service, commercial of the shelf or software as a service products. They have no incentive to share data. Their incentive I do deliver features in most of the organization to deliver business value in some organizations. They are very disconnected from the consumers of their data.
[00:10:11] The quality of the data, especially for analytical purposes, is very low. And they have high transactional but very low analytical skills. In the middle, we have the data lake team.
[00:10:24] And they are overwhelmed by the data sources. Usually, data leaks teams are understaffed. And they are even so overwhelmed by the demand of the consumers. Very often, data lakes are very complex setup with a lot of ETL pipelines.
[00:10:43] They have constant shifting priorities because the consumers want some data, the finance department wants some data. And the leadership wants some data and usually who wins the leadership? They are disconnected from both parties and they are not often trusted. And what make it worse, they are highly specialized from an activity-oriented job perspective role, so very often they are not generalist, they are not cross-functional setup.
[00:11:08] And then we have the consumers and they are very often frustrated by having no priority, no transparency, and they are often not getting a lot of value out of the data and the data products.
[00:11:20] Very often they don't have a lot of analytical skills. So many people just know in Excel how to sum up a column and change the color of a cell. But they do not really know about cohorts, proper visualization, median or any other basic statistical skills. And they're disconnected from the producers.
[00:11:46] There are three classical assumptions that led to the current technical and organizational setup in data lake organizations. And one is data must be centralized. And this comes from, yes, we bought our first data warehouse, and since then we centralize our data.
[00:12:04] Technology, organization and architecture are monolithic. This is as well a consequence of the first data warehouses and the enterprise data models that were created in the 90s and the early 2000s. And technology drives architecture and organization. And this is wrong. My core beliefs are
[00:12:27] data, data lake and a data mesh needs to create value for the consumers. And therefore, we need quality at source, and the source is not the data lake, the quality are the transactional systems. And data originates distributed.
[00:12:47] And so, it must be handled decentralized. And technology, architecture and organizations are distributed as well. And the organization and the communication and the people, this drives the architecture and the technology and not the other way around.
[00:13:05] And this is why 2015, 16, 17. The first mostly digital native companies applied their learning from their microservice implementations distributed transactional systems to the world of data.
[00:13:22] And they took the middleman out between the producers and the consumers, and here it looks like it's just one producer and one consumers. Actually, if you are responsible for any transactional service in a microservice organization or for an application, congratulations, in a data mesh organization, you're now responsible for analytical data. So, you have a lot of producers.
[00:13:48] But they took the middleman out and put the responsibility for analytical data where it belongs, and it belongs in the hand of the producers, the teams that have to share data.
[00:14:00] And they put the responsibility in the hands of the teams that use data. And based on this inverters and the early adopters, the book Data Mesh was written. And there is still a third party. And this is a platform. And the platform consists out of teams that provide tools and infrastructure for the data producers and the data consumer to build data products.
[00:14:30] And let's have a look at a simplified data mesh organization.
[00:14:34] And the communication path and the path of the analytical data flow changed. And as you see, there is no data flow and communication flow to the data mesh platform. And this is because no data is actually stored in a centralized data mesh. Platform. There is a variant, you can have from a technical perspective a data lake centralized, but the responsibility for the different data stored in the data lake is in the responsibility of the consumers and the producers. But in the true nature of a data mesh, the data is stored distributed at the place where it originates.
[00:15:14] So, let's have a closer look at the data producers.
[00:15:19] And as I said, when you own a transactional data or transactional microservice, congratulations, chances are high that you are now a data producer too. And I've chosen here the example of the checkout team, and the checkout team, they had several microservices for handling the complex business logic of a microservice. And yes, they had OLTP databases. Some of them SQL databases, some of them no SQL databases, whatever the team chooses the best data storage is for their microservice. But in a world of a data mesh, they also have to maintain a data pipeline. And they need to provide data products, analytical data products that provides data extracted, transformed and loaded data out of the checkout service for whoever needs this data inside of the organization. And this might be just standard OLAP interfaces with an API. Or it might be very sophisticated IAML models and use cases.
[00:16:27] So, it might be just a sales report thing, how many items were sold today? And it's just a data set.
[00:16:37] Or it might be a product recommendation engine. Based on similar products, when someone bought a certain item, what were similar products that they bought? And this is then an endpoint that can be called by any other application.
[00:16:53] And the data product, the the term that I used a lot. It's not just an API, it's not just an endpoint. As I said, the early adopters of data mesh infrastructure applied a lot of thinking and mental models from the microservice world. And also in a microservice world, a microservice in a self-contained system is more than just data or code. And a data product in a data mesh world consists out of the actual data, so you need to take care about your data, and for sure the metadata. That somehow knows what the data actually is.
[00:17:33] It's the code that you need to provide the data, transform it, extract it and load it. The infrastructure, and definitely infrastructure as a code as well. And the configuration of your infrastructure, your data pipeline.
[00:17:50] What we what we see is, and we can debate this in the in the Q&A session is, the checkout team.
[00:17:59] They're not really able to build data products. Yes, they have a product owner and they have T-shape or V-shape software developers who can, even in a you build it, you run it setup, build a mobile app, a web page, front end, back end, databases and infrastructure in the cloud. And now we ask them to build data products as well, data sets, AML product recommendation. This is tough. And this is why we propose that you enhance your traditional product teams with two new roles. And one is a data product owner, and this is a very business-minded person. And this person understands the context of the team or department, the actual business processes that happens. The data that results out of the business process, how the transactional data of this business processes are stored in the transactional data store. But they also understand the need of customers on consumers of the data. So, they build a roadmap and a vision for the product of the checkout team.
[00:19:10] They need to be strong communicator between because data mesh emphasizes direct communication between the producing teams and the consuming teams. And we also see that, um, the teams get an additional data engineer. At the beginning, later on, what we propose and what we see that successful organizations are able to do is that their traditional T-shape transactional engineers are also able to build analytical skills and build parts of the product.
[00:19:41] But at the beginning, inject this additional skill set in your teams. The data engineer, he works with application developers to map transactional data to analytical data products. And this needs to be a generalist in the data sphere. sizes, direct communication between the producing teams and the consuming teams. And we also see that the teams get an additional data engineer at the beginning. Later on, what we propose and what we see that successful organizations are able to do is that the even that the traditional T-shape
[00:19:34] transactional engineers are also able to build analytical skills and build parts of the product. But at the beginning, inject this additional skill set in your teams.
[00:19:47] The data engineer, he works with application developers to map transactional data to analytical data products. And this needs to be a generalist in the data sphere.
[00:20:02] So, what are the responsibilities of such a data producer team?
[00:20:08] Each team manages the data and publish its data product. So the checkout team publish its data. What I see with with some customers, they have this e-commerce domain and then they create an additional team inside of the e-commerce domain and this is the data team. And this is okay, but still you have a proxy organization now inside your e-commerce domain. And they are not very familiar with the actual transactions happening in in the teams. So, enhance and encourage your actual transactional teams to produce data products.
[00:20:44] What is an important aspect as well, that the data producer teams decides on the technology they want to use.
[00:20:53] And think about the um the organization that I have provided with the e-commerce domain. It might be a micro service driven architecture based on Java, cloud technology, no SQL databases. Why the fulfillment domain runs an SAP system. And why should these two teams with totally different transactional architectures and technology use the same data technology. It's totally okay, as long as they all comply to the macro architecture and the technology stack of the overall organization, that they choose their own technology. And each product provides interfaces to allow others to interact with the product. They provide metadata and the most important aspect is the data product must provide business value. And don't build a data product when you don't have a consumer for it.
[00:21:46] So, and the consumers, As an example, I've chosen the marketing team, so the 27 people with the 150 tools. And right now they use a lot of data, they use social media traffic, they use user segmentation and they use their checkout data products. And in a data lake organization, a data engineer and a data scientist out of the data lake organization would, when they would have time, help the marketing team to come up with analytics and AI and ML tool that is needed in the marketing domain. Here, even the producers, the marketing organization, the finance organization, the controlling organization and even the e-commerce organization, which is a producer, is probably also a consumer of some data. But every consumer also gets new skill and gets new role. So, there is the same role of the data product owner with the same skills and the same responsibilities. And in the area of the of the data consumers, you probably need more the scale of a data scientist than a data engineer. Um because there you need someone who can works with the business team to build insights and insightful data products. And you need someone again who is a generalist in the data sphere. So, the I in the T-shape is more a data scientist, but it has skills of a data engineers as well.
[00:23:17] And what are consumer data products in the in the marketing sphere? It could be a data set for churn prediction. It could be a data set on customer clustering that it used all over the organization for email target marketing, in controlling and also in reports to the board about customer segmentation. But it could also be an end point for dynamic pricing decisions in real time. Or it could just be a dashboard on market basket analysis.
[00:23:52] So, we have a checkout team, which is a data producer and we have the marketing team and this is a consumer. And I said, very easily, we get them additional roles and we give them additional capacity and then they will be able to build data products with their own technology. Is not as easy. Because analytics is not a core competency of these teams.
[00:24:16] And this is where we introduce the data mesh platform.
[00:24:23] And the data mesh platform provides tooling and infrastructure, consulting and training, sometimes also additional capacity and resources. But it's also the central hub for facilitating and moderating a federated governance and security approach. And the mission from my point of view of a data mesh platform is to make the life of the producers and the consumers simple, efficient and stress-free. It's not reducing cost, it's not defining technology standards. The main purpose is to make the life of the producers and the consumers simple, efficient and stress-free. And stress-free for sure also includes security guidelines. And security as a service and security as a code built in in the data infrastructure and the tooling that the platform provides.
[00:25:12] So what are common tools that no one needs twice in an organization?
[00:25:16] So typical products that a data mesh platform builds is access control mechanism. And you need to have access control mechanism. No one should have access to all data of your organization.
[00:25:28] They provide monitoring and billing, they provide CICD pipelines. They provide prepared environments for the data scientist and the data engineers. They provide a data catalog. But from my perspective, in many organization, data catalogs are overrated. Um in many organizations there's a hope, it's like a like a telephone book. You just look up some some piece of data and then you read or some metadata and then you really know what is there proper data source. But it's not as easy as it is. For me, a data catalog is more a hint in the right direction to which team I should talk to. So don't over invest in a data catalog. And they provide insights on cost management. Each team is for sure responsible for its data product and its cost, but you need to consolidate at a single place your overall bill.
[00:26:18] And the essential part also of the platform is education and consulting of the teams.
[00:26:27] I said that the teams can choose their own technology. But you need to be careful with this aspect.
[00:26:35] So, what usually happens when you give teams autonomy, autonomy to choose their own technology stacks for example.
[00:26:44] You decouple your organization and the performance of the teams increase. And the performance of the organization increases as well. But just to a certain point.
[00:26:56] There is a sweet spot in any operating model and in every organization where you need to find the balance between the needs of the overall organization. And the needs of the single teams.
[00:27:08] And one of the sweet spot is which technologies do we really want to have here.
[00:27:13] Imagine in in a microservice world where every team is allowed to choose its own programming language. It's very cool for the team, but it's a nightmare for the organization.
[00:27:23] Depending on your code ownership and code exchange model. But you need to create alignment. And this is part of the platform. And the platform does not have a data architect who defines the macro architecture of the data mesh landscape and the technology stack. But the people of the platform, the senior technologists in the platform, they provide a communication platform and a fireplace where the senior technologists out of the data mesh producer and consumer teams can meet and they co-create and define the strategy of the data mesh platform, the macro architecture and the technology stack.
[00:28:04] But the platform is also the central governance hub and you need in the world of data, you need governance in place.
[00:28:15] But in essence, the platform helps the organization, the data mesh organization to run smoothly.
[00:28:23] So, how do these parties collaborate, co-create and align? And this is necessary because it's a data mesh. It's loosely coupled, highly aligned, but it's not autonomous.
[00:28:36] So, one aspect that we see what organizations and a company in the community implement is to have a federated governance board which increases acceptance. So we have this organization.
[00:28:52] And then you implement a federated governance council, you can call it however you want. It's a data mesh board, it's a data mesh community, but it's not just knowledge sharing, but as I said. Each domain sends a senior technologist in this board, in this council and they co-create the strategy of the data mesh platform, they define the technology stack, the policies and the priorities. And this increase the probability that the teams, the data mesh teams, the producers and the consumers are going to accept and adapt the tools the platform provides. Because this is the biggest issues with platform. The not invented here syndrome. That the teams and this is true as well in in transactional systems and microservice platforms that the teams are not accepting it. What really drives adoption rate is job rotation. So in my organization at Outscout, what we did with our platform, each quarter we would rotate a third of the platform engineers out of the platform into the feature teams to free up seat in the platform. And so that the platform engineers have to eat their own dog food, use their own data services.
[00:30:04] And we would rotate volunteers into the platform teams.
[00:30:08] So that they build the next iteration, the next version of a platform service. What you also creates with this setup is a network. Between people out of the feature teams, informal networks and the people in the platform. And this helps you to build the next version of a service that will be adopted.
[00:30:33] And for sure, you should implement a community of practice. Where people can meet, share experience about the technology they use, about the use cases they implement, about the insights they generate.
[00:30:46] So, where to start with a data mesh organization?
[00:30:53] Most or every, every organization I know is right now a centralized data organization.
[00:31:02] And the beauty with a data mesh is compared to a data lake implementation. Data lake projects are usually big bang setups where you need large projects that last two or three years to implement a data lake. With data mesh and the distributed nature of a data lake, you can have a much smoother, more incremental and more iterative approach. So in our example, you could start with this use case of the marketing team that wants to have data out of the checkout team. And you could start taking out the data for this use case out of the central data lake and depending on what the data product is, have it decentralized stored in the same technology stack either as the checkout team or as the first data product in the marketing domain. And while you do this,
[00:31:52] While you create this use case in a data mesh setup between these two parties, you would also create your first small platform team for a data mesh.
[00:32:03] So, you still would have your data lake, but you would have the first small data mesh platform team creating the first iteration and the first version of an infrastructure that supports the very first use case between the marketing team and the checkout team.
[00:32:23] What I also mentioned before is that you can have a a setup where you have from a technical perspective a data lake. So you still store all your data in a central technical data store. But from a responsibilities perspective, like a service catalog in a microservice world, it's very clear which piece of data, which bucket, which metadata catalog is in the responsibility of which teams. So without having to do any change, you could start your migration into data mesh from a technical perspective just with an organizational change, having clear responsibility for your data lake data with the teams that are consuming or producing.
[00:33:08] And then you would still have data stored being in a data lake written and read, but from a communication perspective, you would have collaboration and communication between the producers and the consumers.
[00:33:23] And there are four steps that you need to do to decentralize your data lake. So, you need to ensure that in your overall organization you have a domain driven structure. And this is not just limited to microservice organizations, you can have domain driven collaboration patterns in in all parts of your of your organization. You have to establish key roles and you have to ensure that you have incentives in place that align the teams in the common direction. You have to move ownership to the source systems and align data formats. And you need to have this federated governance council. Without a mechanism to create alignment, your data mesh initiative will fail.
[00:34:06] What I believe, many organization will will end up in a hybrid data mesh lake warehouse setup. Because there is no one size fits all solution. And you will have some domains who have this direct data mesh style and you will still have other organizations or parts of your organization who store pieces in a data lake. But this is, this is totally fine. And this is one of the beauties of the data mesh approach that you don't have to have a one-size-fits-all solution for data and analytical data products across your organization. You can apply different operational model, different working modes and you can apply different technology setups in your organization.
[00:34:51] But whatever you do, When you apply a change, don't boil the ocean. So just start implementing a data mesh, start with your first use case, face uncertainty and there is uncertainty.
[00:35:05] But then observe, orient, decide and then act just in the right direction. Try something out, measure your progress and then adopt.
[00:35:15] So, thank you very much. As I said, we are a bit earlier, any questions?
[00:35:51] Um if we think about the um different uh kind of team, uh team topology speaking, um where does the Federated Governance Council fall?
[00:36:07] Is this some kind of enabling team or not at all?
[00:36:14] Part part of the responsibilities of the platform are enabling. And you might, when you think about a large platform, you might have dedicated people just taking care about enabling and enablement and then you could call a team out of five, for example, if a data platform is an enablement team. I don't think that I would implement a dedicated enablement team, for the beginning maybe, later I would not have the data platform team really really ever as a as a enabling team.
[00:36:49] Yeah, thanks, thanks for the talk. Uh, I've been in two presentations today where we've been told that we should reduce uh communication link between uh all the team inside of uh an organization and I kind of feel like we are doing the opposite uh with uh what you've presented going from a data lake where the communication is between a team and a data team. to this data mesh structure with a lot of different uh communication.
[00:37:24] Yeah, so every every communication is a dependency and every dependency makes you slow. But when you have a dependency and obviously you need to have some communication and dependency between a consumer and the producer, having a dependency between two parties is better than having a dependency between three parties or in the case of many data lake organizations, you have a dependency between two parties, but one is hardly responding or they're not really really familiar with the actual use case in the real data. And this is why then the communication is slow and cumbersome. Thanks.
[00:38:08] Yes, uh thank you for the talk. Um, let's say I have a use case, I want to aggregate transactions for my customers from different services that are managed in different uh services. How do I manage this use case with data mesh? Who will be responsible to to doing the aggregation?
[00:38:32] You said the platform.
[00:38:34] Ah yeah, to transform.
[00:38:35] Yeah, it it depends, it depends. So many many customers I talk to have then their reflex to put it in the platform and come up with a just an aggregation product.
[00:38:47] And in very rare cases there is justification for just an aggregation service, but in many cases that I've seen from an implementation perspective, these aggregation services got too complex.
[00:39:02] Um and they have too many speculative design and they're covering too many use cases for who might need it in the future. And this is why my general advice, the consumer that needs to aggregate data out of different sources is doing this aggregation and is providing a data product that is aggregating the services.
[00:39:22] As one data pipeline and then in the second step, it creates its its use cases. And then what's sometime happens when it's uh implemented well that this aggregation service, this data product, aggregation data product is used from other teams as well. And then you need to decide who is taking ownership of this aggregation service. Uh then it starts to get tricky. And if you have a very large organization, then it's a candidate um like you have with um complex systems and team topologies, where you then have a dedicated team either in the domain of the consumer of the producer who is just taking care about this aggregation service. But don't, don't do it too early. Then in the second step it creates its its use cases. And then what sometime happens when it's implemented well that this aggregation service, this data product, aggregation data product is used from other teams as well. And then you need to decide who is taking ownership of this aggregation service. Uh then it's it starts to get tricky and if you have a very large organization then it's a candidate um like you have with um complex systems in team topologies where you then have a dedicated team either in the domain of the consumer of the product producer who is just taking care about this aggregation service. But don't um don't do it too early.
[00:40:11] Um as far as I understood, it's like there's there is data replication between what is produced and what is consumed, you know, when when the data is transferred from one system to another. can you estimate the overall cost of that? Because if the data is replicated almost everywhere, it's it's it's like of.
[00:40:35] So, the first answer is storage is cheap. Um but the second is also in in a data lake, um and maybe even in data lakes, um you have you have um multiple versions of your data. Um so I'm not really sure if data mesh really duplicates more data. I would I would say that because a data mesh is much clearer, who is using which data from whom and has local ownership and oversight of the actual data product architecture. And cost responsibility, because who is who is responsible for the cost of a data lake? It's the other data lake organization. It's not the consumer, it's the producer, but with this ownership and responsibility for the data products, you also put the responsibility for the cost in in the hands of the consumers and the producers. And this is why usually, um the data redundancy is less than in a data lake. But it depends on the culture of the actual organization.
[00:41:35] Okay, thanks.
[00:41:45] Um you showed one slide with um the notion of addressability, trust, trust, uh discoverability, interoperability and those I forgot about. Um is the notion of data, yeah, thanks. Is the notion of data contract at the border of the of a data product, um part of the answer?
[00:42:12] Yes, it is. Yeah, yeah, so you clearly need to have a data contract in a written form and in a technical form that it's obvious for everyone in the organization.
[00:42:23] Um can we consider there is a de facto standard for data contract? Like um the one PayPal published, or is it too early?
[00:42:37] I think it's too early.
[00:42:54] I have a question regarding the data quality. Because who is responsible of the data quality? Is the we say the producer. Or because we are discussing about the contract between the consumer and the producer. This is who has to deliver the quality and and if you have like a chain of responsibilities between different producers and consumers, how do you monitor the quality end to end, I would say?
[00:43:30] I'm not sure if you really can can monitor the quality, but it's um it's clear that the quality starts at source. So the the owner of the transactional systems and the transactional services that now build data products that makes transactional data as an analytical product available for the consumers, they have to take care about the quality of this data product. And if they change something or if something changes in the in the business process and in the transactional data, they have to take care that the quality of the analytical data, of the data product is in sync with all the changes of the transactional world. So if you get a user story in and you have to jump to change something in your microservice, you have to change as part of the same story your analytical product as well. And as a consequence, you need to inform the consumers of your data product about the change. And hopefully you do you do a change that does not break like you do with with other APIs and services all all the consumers. But nevertheless, if someone a data consumer creates a report out of data, they have to ensure as well that the data that they use in this report is also correct.
[00:44:59] Um you you presented a slide with a limitation of data lake. Uh what kind of limitation you see in data mesh?
[00:45:09] So, when when microservice came out, Martin Fowler said you need to be as tall as this um for microservices. And it it was a metaphor out of this roller coasters in the US leisure parks where you need to be as tall to um yeah use this dangerous roller coaster. The same is true with with data mesh.
[00:45:33] Um so you need to have a certain maturity in your organization to being able to to run such a distributed setup with distributed responsibility, distributed communication channels with alignment on technology, with freedom and autonomy. But and this is for me the the beauty with with data mesh that as a as a community, as an industry, we have learned a lot in the last 10 years, microservices are depending on on how you count it one or one and a half decade old, we have learned a lot how to run distributed and decentralized organizations. And all of the learnings apply for data mesh. So it's not as new, as scary as many organizations think. But yes, it's it's not for starters.
[00:46:33] Maybe me. I have a question. Last question. Yes, you're talking about a data product owner. So on my side I'm mostly working on classical software development teams, so with backend, front end and normal product owner, most of the time the data people are somewhere in the company. Um when you yeah, when you talk about the data product owner, does it mean that they come to our team? And like I have a team with a normal product owner and a data product owner, or are they another team but within my domain? How does it work exactly?
[00:47:09] Your team, your team get two new team members and they're full-time team members. And one is a data product owner, and one is a data engineer. And this data engineer solves and builds data products at the beginning, but it's also training you that you are able on a on a one or two year perspective like you are able as a full stack developer to to develop front end and back end to also um handle most of the data use cases in your limited scope and domain.
[00:47:43] Does it mean as well that the people in the team maybe should start learning some data stuff so that there is not only one person that is expert on data?
[00:47:54] Exactly.
[00:47:55] So we want a Dev Sec Ops data people and etc, etc.
[00:47:59] Yes, yes, yes. Your your T-shape is getting broader and it's getting broader and broader and broader. And this is why you need why also the platform needs to provide tooling and why and it's not just because I'm I'm from AWS, um cloud native services and managed service. They reduce cognitive load for a lot of this just managing infrastructure stuff and then you can you can really take care about the business aspect of transactional use cases and of data use cases. But you should you should um use less and less less effort on just technology.
[00:48:42] Are there exist existing tool that already support data mesh?
[00:48:47] Yes, all tools that are used for data lakes, you can use for data mesh as well.
[00:48:52] Okay.
[00:48:53] There's no specific, it's just it's the same technology but applied differently, distributed.
[00:49:00] And don't don't buy a data mesh.
[00:49:04] And have you seen this working outside of AWS?
[00:49:09] Pardon?
[00:49:10] Have you seen this working outside of.
[00:49:12] Yeah, yeah, yeah. So you see this with Azure and GCP as well, you can also implement it on prem but it's much much harder for the data producers and consumers with um yeah managing handling open source tools and everything. So the cognitive load and the learning curve is is much higher.
[00:49:33] And what were the main difficulties you always meet when you try to move to this kind of?
[00:49:40] Yeah, people people don't want to learn. I'm a front-end developer, I don't want to care about data, I'm a backend developer, I don't want to care about data, I'm a database developer, I don't want to care about data. Okay. Um yeah, it's it's adapting to change. And it's it's always a balance in your operating model. So for sure you make you make it a bit more harder for the teams in the aspect of building certain things, but from an overall perspective, from an organizational perspective, it's much much easier for the organization to leverage the data. But I don't I didn't want to bash data lake, so there are a lot of data lake implementations that support the needs of their organizations very well. So both are valid architectural patterns and organizational patterns for the organization. And each have um strength and weaknesses. But I'm a bit of a fan of data mesh.
[00:50:40] Good. Thank you very much. Thank you guys.

Transcript