Gayathri Thiyagarajan
Transcript
[00:00:06]
Hello everyone. Welcome to my talk on assuring data quality at scale. This will be a study of data mesh in practice. Um, before we begin, I want to thank Flowcon for giving me this opportunity and inviting me to talk about this topic, uh, today. I'll be covering a quick primer on what data quality is, why I'm going to be talking about that in particular, um, moving on to what is data mesh, um, and the challenges it introduces, particularly when doing this at scale, um, but also what are the opportunities and challenges when you're doing it at scale. I am Gathi Harajan, I'm a senior engineering manager at Expedia Group, uh, based in UK. I'm a public speaker. I've done many talks on the subject of domain driven design, um, principles. I've done few projects, um, since 2015, um, while I was working for another consultancy based in UK. Um, but I joined Expedia about four years ago, and since then I've been managing many data products where I've had an opportunity to do even storming and again apply DDD principles unexpectedly. Um, and by extension, obviously that was about the time data mesh as a concept started to evolve, so I had an opportunity to practice that as well, um, as Expedia, um, as an organization grew. I'll be covering more about that, um, in a bit. Um, I've also published blogs and articles and field stories. Um, I'm a huge DDD evangelist and don't need an excuse to practice that, um, given an opportunity.
[00:01:50]
I thought this slide might not make a lot of sense, um, to this audience, but having seen that there are quite a few DDD talks, um, in the schedule, I thought I'd leave it on. Um, given that I'm going to be talking a lot about data and data engineering, I thought I need to put a warning on the language and the semantics that I'm going to be using. Um, so here when I say pipeline, I don't mean CICD pipeline that we use to ship code. I'm going to be referring to data engineering pipelines and data pipelines. Um, and when I'm talking about metrics, I'm not going to be talking about system or application metrics, um, but very similar by in the context of data where we profile data, more about that later. Um, likewise, checks, um, are not code quality checks, but rather data quality checks. And finally, application is just application everywhere.
[00:02:48]
So, how did this start?
[00:02:51]
Um, so Expedia, if you don't know, is a combination of multiple brands, some of the famous ones are brand Expedia, hotels.com, or VBO, which is the vacation rental arm of, um, Expedia Group. Previously, they all used to operate in fairly independently, um, but recently Expedia went through a transformation at the organizational level. Where we started to group products not based on brands, but started to align them according to their domains. So this was mainly to remove duplication that having these brand level alignment had introduced. Um, but also we wanted to take that opportunity to optimize our stack, bring consistency in our runtime platform infrastructure, all the way through, um, from how the customers interact to, um, how our applications run. But while we were doing this, it naturally made sense to extend this to data, which was up until that time, um, pretty much an afterthought. Um, and I've had like a lot of experience where we were scraping data on the way, um, of browser requests being sent to the backend, with the teams giving no thought about how the data is going to be, um, applied or used. But as we started to align the data products by domains as well, um, that started to bring in a mind shift where the team started to think about data as part of their product teams. Um, this is pretty much what data mesh concept is about. We'll see that, um, soon. This was about three years ago, around about the time Zamak from Thought Works wrote a, um, blog post about data mesh architecture, um, on Martin Fowler's website.
[00:04:54]
This aligned pretty much to what we were trying to do. So, we started to adopt some of the, um, principles that she laid out in that article by, um, building centralized data platform capabilities for streaming, data lake, data storage, um, so on and so forth.
[00:05:14]
But as we did this, um, alignment by domains rather than brands, suddenly the scale with which we were dealing with the data, the products, um, kind of became much bigger, at least three times. Um, so we had to scale our capabilities accordingly as well. So, in this backdrop, um, there was a gap that was emerging. So we didn't have a capability which assessed the quality of the data that was being captured, particularly at that scale. So we were churning out a lot of data products, migrating and shipping the products that were already there, but we had no idea what was the quality of the data, um, how reliable was it.
[00:06:00]
So, briefly what what is data quality? What do I refer to when I use that term?
[00:06:09]
As you know, in more recent times, particularly, um, data is the lifeblood of any data-driven company, and pretty much all the companies want to be data-driven. What does that mean? So, data-driven organizations are those where data is key to every decision and is every part of, um, uh, the organization. So it is used to getting insights, making business decisions out of the data. So, it's pretty important that that is reliable, dependable, and trustworthy. Yet you would be surprised to hear that a lot of organizations don't have a central tool by which they can monitor the data they are, um, actually collecting, XPGA included, um, until recently.
[00:06:56]
So there is no, um, guarantee or validation done on that. People just blindly rely on what was captured and, um, used what they could actually get hold of, which was better than nothing.
[00:07:10]
Um, on the engineers' side, particularly data engineers, if you think about, um, what the impact of a bad data quality could be was, um, you would have broken data pipelines.
[00:07:20]
You get woken up in the middle of the night with a P1 issue, but you'll have to trace all the way through to where the source of the data was to identify what caused the pipeline to break. Um, and then once you have identified, you'll have to reprocess, clean the data and make that data usable. This was not a small, um, task.
[00:07:46]
Therefore, a capability or multiple capabilities in some cases was, um, needed to provide reliable and trustworthy data.
[00:07:57]
High quality of data, without a doubt, was important for any organization which needed insightful and dependable data analytics, um, and to derive meaningful data-driven decisions.
[00:08:14]
So therefore, you need a tool or a capability that measures and gives you an indication of what that quality is, right? So you need something to collect that metrics that gives you an evaluation of, um, how good a data is and what you can do if that is not good enough.
[00:08:33]
This is what, um, Martin Fowler has, um, to do, has to say about, um, this particular topic.
[00:08:41]
Um, so in any medium and large-scale enterprise, you typically move, um, billions of events and hundreds of terabytes of data through either Kafka streams or data pipelines, landed into data lakes, data warehouses, you name it. There's data flowing somewhere, um, at any point of time. Imagine running, um, scale distributed systems and microservices which we talk about a lot these days, but you don't have any operational metrics that monitors the health of it. That's a pretty scary thought at this age, right? Um, but why should we be running hundreds of data pipelines, but you don't have any metrics on them to say if they are running well, if they are breaking, if so, why, or at least have some kind of an alerting or monitoring on them. So there is a need to do this centrally, which is undisputable, but also at scale.
[00:09:42]
Um, and this is also important because you need to provide transparency and trust on the metrics that is being calculated on this data because the data that is being owned is going to be consumed by a different part of the organization. So you need this to be done centrally and in a way that is, um, visible to all the parties concerned.
[00:10:05]
How do these quality issues typically, um, tend to manifest? Let's see a few examples which I grabbed from our production systems and usually that pop up in, um, our support channel.
[00:10:20]
I don't know if it's visible enough. Um, I'll just rush through these.
[00:10:29]
So, if you can't see them, so these are some of the, um, issues that have come up particularly in, um, our clickstream data, which is captured, um, clickstream is actually a really good example of impact of bad quality data. So, clickstreams traditionally, um, captured from multiple teams, so you get a varying quality of data across these teams, um, and then it's also consumed by a multitude of teams, be it machine learning for personalization, be it real time, um, or offline analytics, marketing, you name it, clickstream is used somewhere in, um, your organization.
[00:11:14]
So, naturally you can expect all kinds of issues, and if something happens, there's loads of screams, um, coming out of that, usually the team which owns that, uh, data, which is the clickstream team, gets shouted out. But these are just root causes, right? The consequence is not immediately, um, evident. So you could have anything ranging from pipeline disruptions or outages to the data pipelines. Sometimes, um, there are features that are built on top of these data. Um, so we have a feature called recent searches which relies on, um, a customer doing a search and that data is available near real time for, uh, when you go back to the homepage, you see what you search for. Um, so you can imagine if something goes wrong with that data, the customers don't see it. So it's directly, it manifests as poor customer experience. Um, or predictions are incorrect or irrelevant, um, and, um, so on and so forth.
[00:12:15]
But at the same time, it's important to also differentiate quality issues from legitimate data drift.
[00:12:24]
Because obviously customer patterns change. Um, COVID is a good example, or if there is a conference just like this and people are flocking towards it, obviously, you know, you you you see, um, there is a spike in, uh, the customer patterns or change in customer patterns. So you need to know when that's happening versus when, um, the data is genuinely, um, anomalous. So, broadly, um, these quality issues fall under few categories.
[00:12:56]
The first one is, um, incomplete data. So, all the data pipelines and, uh, streaming applications expect a certain amount of, um, completeness of the data in order for that to be called usable.
[00:13:13]
Um, incomplete data means the, these processes, data processes cannot actually process that data, so they'll have to skip over that or completely stop the pipeline when the amount of unusable data is, um, exceeding a certain threshold. So, as a result, you'll have to reprocess, and we're talking about like millions and millions of, um, of data, particularly with data batch processing, you get a day's worth of data coming in normally for marketing purposes. Um, and from third parties, so you end up reprocessing that data once you have identified and fixed the issue or cleaned up the data. So this obviously is a lot of compute power, um, and delay in the final data being available for, um, for use. If you think about streams, which is a real-time, um, availability of data, uh, reprocessing is not a simple task at all. Uh, in most cases, it's impossible, near impossible to reprocess the data once it's gone through your Kafka stream. Um, and, uh, if you take certain domains like finance or tax, the even small amount of incomplete data is just not acceptable because you wouldn't have any kind of reconciliation, um, possible with that data.
[00:14:31]
The second classification or bucket is, um, incorrect data. So normally, this means the data violates a certain expected pattern, um, and it can lead to incorrect handling of that by, um, the processes if they don't know what is incorrect, invalid, um, uh, data. So you could have that directly reflected in machine learning training, which if it gets, um, trained with incorrect data, which affects the prediction or personalization, um, features on your your website, it would be incorrect.
[00:15:08]
Um, but also, the data is used for things like product performance, you do unique visits per, um, by user, um, or, uh, unique visitors, visit counts, and so on and so forth. So all of that metric would be unreliable if the data is, is invalid.
[00:15:30]
Um, late data, I'm sure a lot of people can, um, immediately relate to this. Data is not immediately available for your business case, um, or purpose. Um, this could happen, uh, if your upstream pipeline is disrupted or broken, uh, because of various reasons, um, there's no, uh, enough complete data available or the pipeline's been stopped for other purposes and it has to be reprocessed and so on. Um, but also, with a stream, it means your applications that are dependent on that stream data real time can miss SLAs and SLOs through no fault of yours.
[00:16:13]
So final bucket, um, here is where there is a difference between source and, um, destination. So this directly relates to data loss in transit, um, and again, in a lot of organizations, there is no way to directly, um, measure this if you don't have a dedicated capability which does this.
[00:16:37]
So, how do we measure data quality? So all of this builds up to how we do this at scale as a central capability as data mesh, um, recommends that we do. Um, but before we look into the actual implementation of, of a solution, we need to understand the data quality concepts.
[00:16:56]
So, data quality, um, is often a fit for use, um, principle, it's normally associated with that, which refers to the subjectivity and context dependency of this topic. So what is a good quality data for your use case, um, may not be sufficient for another application.
[00:17:15]
It's also typically a multidimensional concept, so you have various dimensions which, um, represent what a good quality data is, such as accuracy or timeliness, which can be directly related to some of the buckets that we just saw.
[00:17:30]
Um, and data quality metric, um, is a function which maps these quality dimensions, for example, timeliness to a metric, um, which is calculating what a latency is. So usually the dimension is a function of that metric. And you can aggregate these data quality metrics at various levels, so you could have that at column level, attribute level, or at the dataset level, uh, depending on what dimension you're calculating and for what purpose.
[00:18:05]
I'll I'll rush through these. Um, I don't think the mathematical formula is going to be super interesting, but if you want to take a look, the slides will be shared afterwards. Um, so completeness, uh, which is directly related to the issues about incomplete data. So this is, um, one of the key dimensions of data quality. So this generally describes the breadth, depth, and scope of information that is contained in the data. So it also covers the condition that the data must exist in order to be complete, which means null values or rows completely missing would also be attributed to this dimension.
[00:18:46]
Second one is, um, accuracy. Which describes the closeness between, um, the information system which captures and stores the data to what the real world represents. This is an important data quality dimension, obviously, everyone wants an accurate data to, uh, to be used for their purpose, but it's also quite difficult because how do you objectively quantify what's out there in the real world? Um, but sometimes you, you use, um, a metric that is close enough, um, to what is in the real world to calculate accuracy. So accuracy is not always accurate.
[00:19:26]
Um, so there is no objective way of measuring that. For example, cross-data set correlation between, um, your clickstream data to your order system. So, number of clicks on the order button, for example, or successful page display post booking should match the number of orders that's been entered into the system or cancellations, vice versa. So, that's a good way to compare two different data sets and calculate what how accurate is one, um, or the other. close enough to what is in the real world to calculate accuracy. Accuracy is not always accurate. So there is no objective way of measuring that. For example, cross data set correlation between your click stream data to your order system. So, number of clicks on the order button, for example, or successful page display post booking should match the number of orders that's been entered into the system or cancellations, vice versa. So, that's a good way to compare two different data sets and calculate how accurate is one or the other.
[00:20:01]
So the third one is consistency. So this captures violation of semantic rules defined over these data items. So this captures violation of semantic rules defined over these data items. So you could be tuples or records in a file. Referential integrity is a good example. So these are normally rules that require some domain knowledge and violation of these rules are captured by this dimension.
[00:20:31]
Last one is timeliness, as mentioned earlier, this is a function of latency as a metric for example. So how current is the data that we have and we are using for our purpose. It's also related to the notion of volatility, which is how fast does this data become irrelevant or stale.
[00:20:57]
There are other data quality dimensions as well depending on which literature that you refer to. Freshness and validity are included in some. But for general purposes, these four normally suffice.
[00:21:15]
So before we move on to address the concept of data mesh. I want to take a moment to impress the complexity here, particularly with when we are describing data quality because I refer to this briefly. Data quality metrics have both subjective as well as objective qualities associated to it. What does what does that mean? The dimensions that we saw completeness, accuracy, timeliness and was the other one consistency. So these are called hard dimensions. These are objective quality or quality measures. These don't tend to change depending on who's using it and who is observing these dimensions. And this can be done with very little domain knowledge apart from a few dimensions. But there are soft dimensions as well which require subjective evaluation. So it needs it's very much context dependent, it needs a lot of domain knowledge in order to evaluate these dimensions. For example, if you are building an application which is making use of a SQL data store and you wanted to run or calculate the quality of the data based on a certain column being complete. Or a certain column being present conditional to a certain other column, which is very specific to your application's needs, which may not define the quality of the data for a different team or different domain or different product. Which might be using the same data store. Duplication is another example. It can be either way, so a duplication criteria what constitutes a duplicate might be different to another application which is looking at the same data.
[00:23:12]
So this diagram hopefully explains that concept easily. For any data set, there usually
[00:23:20]
two, if not three different actors associated with that data. First is obviously on the left side, producers who generate the data. And this could be more than one team, it could be external producers, third party, you name it, it could be anybody who's producing this data. And bringing that data into your organization. They normally agree on a contract called schema in most cases, and their responsibility is to produce the data which abides by that contract, which sticks to that contract. And they are sending the data at an expected frequency.
[00:23:58]
And then on the extreme right are the consumers who build applications, processing pipelines.
[00:24:05]
Which uses the data or they simply look at data like table and perform analysis frequently.
[00:24:13]
So these group of people normally are on the receiving end of any data quality issues and they will have to in the traditional system constantly chase the data products or producers or the owners to find out Why a certain analysis doesn't quite look right or a certain data needs more cleanup before it can be used to train a model and so on.
[00:24:35]
And finally in the middle are the data owners who owns, maintains and supports the data. This may not always be the case where you have data producers separate from data owners.
[00:24:47]
But where you have a multitude of owners, this normally requires a central team which owns that data and can assure the data is constantly available if there is anything wrong, the team looks after it and so on.
[00:25:05]
So as you can see, depending on who you are and what your purpose is, the subjective quality changes. For a producer as long as you're matching the schema or like ticking all the mandatory and the optional fields and sending the data fairly frequently or at the expected frequency, you have passed your quality checks. But for a consumer, this needs to be diligently followed and adhered to, assured. And for the owners as well, because they have SLAs and SLOs to meet, which cannot be missed.
[00:25:45]
Having covered what data quality is, let's take that concept.
[00:25:51]
Because it started with us identifying a need to have a central capability which gives that assurance for the data sets that we had in Expedia to where we were at that time. Data architecture has evolved over time, we've had data lakes come in, data warehouses and so on and so forth. But in the last couple of years, there's one particular concept which has captured a lot of interest, obviously, but also radically changed how data is being perceived by organizations. And with this concept being more and more gaining traction, data is now being taught as a first class citizen rather than an afterthought. So what is this?
[00:26:44]
As the Max when she wrote the blog post, she mentioned Data Mesh as an architecture. But if you look at her book or her recent editions of the blog post, she calls this a socio-technical approach. It's a decentralized socio-technical approach for sourcing, managing and accessing data.
[00:27:07]
For mainly analytical purposes, but you can use it for any use cases at scale.
[00:27:15]
So what does data mesh recommend? It recommends well-defined boundaries for those of you who are familiar with domain-driven design principles, particularly bounded context, can relate to that more. Where you have well-defined boundaries around your products. So this extends that to include data products as well, where before this used to exist outside. So you encapsulate all the apps and the data that they produce and encapsulate them into boundaries. So the product owners who own these domain products own the data products and the data that they produce and consume as well. So along the way, when this data mesh principle was more and more practiced when it was turned into an application of that architecture into enterprises, it unearthed few challenges and gaps that was existent at that time. Which was the lack of ability to discover, understand, and more importantly, trust and use good quality data.
[00:28:22]
So what was the objective of Data Mesh? The emphasis here is doing this at scale. Because as you decentralize the data products and put them within each of these domains, scale suddenly becomes a problem. So how do you decentralize the data, but provide a more centralized capability or collection or suite of capabilities that can process, handle this data, take from producers on the left side, if you remember that diagram, all the way through to the right side for the consumers to consume. So scale here covers a few things. One is change in the data landscape, so the landscape has evolved to have so many more frameworks, tools, where before we didn't even think about like streaming. There are more and more adoption of streams and Kafka streams these days and there is Pulsar and other streaming technologies coming up as well. There is a proliferation of producers and consumers of data with IoT and, you know, data is almost gold dust at the moment that you don't, you want to capture as much as you can use or even more. There is also diversity in transformation and processing and speed of response to change. You want the organizations to react to that change in data, change in customer patterns immediately and effectively.
[00:29:55]
So any data mesh implementation, as we wanted to do at Expedia, should embody these principles. Remember, we are building this at scale. And that's typically what we were looking at, we were looking at thrice the scale before where we started. So, it should deliver quality and integrity guarantees needed to make that data usable. It should be decentralized. The domain ownership shifts to the product owners, data as a product, self-serve data platform and federated governance capabilities. Because as you decentralize, it's not just one person who's owning the data or responsible for the data, but you have a collection of owners who need to be accounted, made accountable for how they process PII information, PCI information and so on.
[00:30:53]
So tying this back to the data quality concept we saw, where does that fit in with the data mesh architecture?
[00:31:03]
So ownership for data quality in particular has shifted to the left to be closer to the producers of the data. So domain products and the product owners own the data. As well as giving assurance for the data quality and responsible for it. So it advocates central platform capabilities that work at scale.
[00:31:26]
For not just one type of data source, as we just mentioned, there is a change in how data is being stored these days, you have no SQL data store, SQL, so many streaming technologies, data lakes on cloud and other cloud vendors and so on. So your tool should be able to support all of these stacks. Subjective and the objective measures of data quality and well-matured, transparent way of measuring these so that it can be monitored centrally.
[00:32:06]
So what does this mean to build such capability at scale?
[00:32:13]
So data quality is a huge space. So normally when you talk about data quality, as it was last night, somebody asked me about the governance, so yeah, so you as you start talking about these metrics and dimensions, a lot of concepts start to come up and they all have some level of commonality that you would think data quality would cover all of it. But as you set out to build at scale, you don't want even more complexity than you already have to deal with. So normally they tend to cover like monitoring, how do you measure these metrics? And then translate them into dimensions. How do you notify the data quality issues on time and near real time? Lineage, so obviously the pipelines, data pipelines in particular, are chained, and you want to know you want to trace all the way from the source to all the transformations that's happened in between all the way through to where your application starts to consume.
[00:33:16]
Obviously auto remediation is a much desired outcome of when you spot a data quality issue, you don't just want to be alerted, you want that to be fixed and fixed quickly. Root cause analysis, KPI metrics and system quality metrics can be thrown into that mix as well. But whatever is, you know, whatever means data quality for you, just make sure that when you're solving for this at scale, the scope of it is well defined.
[00:33:56]
So what does an effective data quality monitoring taking those requirements to some extent and probably dismissing some of those as being out of scope, as a central capability, an effective data quality monitoring tool Should contain first and foremost a standardized definition for your organization of what those dimensions and metrics mean. Because you're no longer dealing with individual teams or products here where you can make up your own definition and share that with just your consumers, you need a standardized way by which you define this data quality because you're dealing with all the domains across your org.
[00:34:40]
Support for different sources of data. Obviously near real time, because if there is an issue, you want to know that almost as soon as you receive that data.
[00:34:53]
Alerting, notification, be it through Slack, Page duty or whatever incident management system that your team uses, that needs to be integrated with again at the team level. Measuring and making this available centrally and in a transparent way. You could also use scoring at the data set level so you can make a function out of the data quality dimensions and calculate a score that then shows and hopefully encourages better practices of how this data is captured and stored within the org. Monitoring and tracking trends over time and a simple and intuitive customer experience.
[00:35:39]
There are some offerings that are already out there to do data quality, so before we set out to build this centrally. But this also highlights the maturity of Data Mesh as it stands today. Even though as a concept, it's brilliant, it makes sense, almost intuitive. But the maturity of the tooling that is available is just not there. So for example, there are many open source as well as licensed versions that allows you to do this, but they all carry some limitations that may or may not be suitable for your org. But always a good idea to look at what's there before you start building your own.
[00:36:27]
But before we started doing our own, obviously we engaged with our stakeholders to see what they had in place already. There were many homegrown options, not everyone was as blind as some of them.
[00:36:45]
But the options were more ranging from rudimentary where they had like baked checks within their code or they did like samples of data and then run checks to more sophisticated but very much tailored solution for their own team which cannot scale, which is not suitable for anybody outside of their team. They didn't have capacity to bandwidth to keep that well maintained and supported because that was not part of their core product offering. And and so on. Some teams just didn't have any idea there was any issues with their data pipelines until that's happened or at least couple of days after it's happened and someone else is screaming at them. So none of these was ideal for us to roll out at the scale that we were looking at. So just want to quickly give an approach that we used for those of you who are interested in how we went about solving this problem.
[00:37:53]
So what we did was the actual platform capability that we built had four stages. So first one was ingesting the data, ingestion. The second one was profiling. The third one was the checks, actual process that ran the checks on the data. And this was across all the data sources, everything. And the last one was the notification. Underpinning all of this was the configuration that we captured from our customers, if you remember. The subjective analysis requires the knowledge and input from various domain teams to tell us what they want in terms of the checks that could be run on a data set that interest them.
[00:38:37]
So what does ingest means, how did we achieve this at scale? Simply by not building everything on our own. We in-sourced all the components for various data stores and other technology stacks as and when the need arose. We built the plugins for ingesting streams and data lake, which was more commonly used, uh, widely used data sets in our org. But for Mongo or Cassandra, we didn't go about building all those ingestion adapters ourselves, but we rather invited the teams to contribute so that we can not get bogged down by all of the technology stacks that we need to support. Uh, but at the same time, provide that diversity option for for our customers.
[00:39:31]
So we ended up using these technology stack to support the ingestion of the data. built the um plugins for ingesting streams and data like which was more commonly used, uh widely used data sets um in our org. But for Mongo or Cassandra, we didn't go about building all those ingestion adapters ourselves. Uh but we rather invited the teams to contribute so that we can not get bogged down by all of the technology stacks that we need to support. Uh but at the same time, provide that diversity option for for our customers.
[00:39:32]
So we ended up um using these technology stack to uh support the ingestion of the data.
[00:39:43]
So the next one is profiling. For to me this is the foundational part of how you can run data quality offering at scale. Um it's not by running these checks directly on the raw data set as some of the teams were doing, which was fine at their team level. Um but there was no way to to scale because we needed to know a lot more about their data than we wanted to. So by profiling what this means is you collect metrics on the raw data that is relevant for us to identify data quality issues. Um for example we were collecting metrics like um number of rows that the data had, number of fields that were missing within the data set, um or at the row level, how many fields were populated, how many nulls were there, um histograms, cardinality of uh a particular field, latency, what was the distribution, what was the uh size of um the data and also we calculated some statistical distribution on each of the field, what is the min max, median, mode, mean, um standard deviation and so on. So we kind of calculate as many metrics as we possibly can. One really good side effect of this process was this was fairly independent of um our purpose, which was to solve data quality or be able to measure data quality. You can use these profiling for exploratory data analysis which a lot of data scientists uh and machine learning engineers use these days, uh before they set about using a particular data for their um modeling. And you can also use this to calculate um data quality score or use derived metrics, for example, you can do a ratio of um such metrics to calculate the duplication um ratio and and so on. So the it's endless, the possibilities of what you can, how you can use this.
[00:41:49]
The next one was um checks. So all of these stages were completely decoupled and um asynchronous as well as we used streams for the um output and the input of every single stage. So checks ranged from running simple threshold checks all the way through to using uh sophisticated neural networks for doing anomaly detection on um these profile metrics. More on that later, the anomaly detection part was particularly challenging.
[00:42:21]
Um but what this means is this is where the subjectivity of um the data quality as a concept comes in. So you need to be able to capture and run several checks on the same data um and uh sometimes multiple checks for a same field and attribute. So the overhead of doing this at scale is much, much higher. And if you calculate the permutation and combination, you are looking at the data set, row, field level, uh times the number of metrics that you're calculating and then you're running the checks on multiple checks on um the resultant metrics as well.
[00:43:03]
Um one important thing that we did was um we ate our own dog food. So we, as I mentioned, we had um decoupled each of these processes. So we ran checks on our own data quality metrics that we were calculating, the profile metrics. So we would know when uh any part of our system was broken, but we'd also know how accurate um the calculations or being or the anomaly detection is being to predict any issues.
[00:43:33]
So finally the uh notification, as I mentioned, really depends on what works best for the domain teams. Um and you want to be able to support however these teams are set up to manage incidents, um and be notified when there is an an incident. And you want to be able to report on these incidents as well, um as um any remediation that's followed um and um any incident management and root cause analysis that happens um after that.
[00:44:07]
So last couple of um slides here. So I just want to summarize what was the opportunity of doing this at scale. So as I mentioned, data mesh as a principle is very obvious, makes a lot of sense, but it comes with the inherent complexity of doing this at scale. There is a reason why Zamak turned that from an architecture to a decentralized sociotechnical uh principle because it's not just a technical implementation. It needs reorganization of your teams, it needs tooling support, it needs the organization support to build these capabilities at scale or at least find and uh buy products that can help you implement it. But having said that, well, I I do have a slide which talks about challenges. I probably jumped the gun there. Um the opportunities are much more. A, it creates transparency um and trust at the organization level. So I have led data product teams where you have analysts and marketing team, uh really not sure about the data that is captured even in house. So we've had to rely on third-party uh libraries to collect reliant data like Adobe analytics or Google analytics. Um but such was the mistrust in the data that was already in the org or in data lake. It rewards cleaner data practices, um because you can calculate the score, make it um visible in your data catalog, um and say how usable is this data set. You can use red, green, um amber to highlight um the usability of the data as a whole.
[00:46:06]
And it extended and promoted collaboration of the platform. As I mentioned, we invited a lot of contributions, so we didn't build all of this on our own, take this on um ourselves, we just built the core. Um but everything else was pluggable and extensible. We made the profile metrics available for for example, the data scientists to use to do their exploratory data analysis, uh even if they don't give a monkey about data quality, which they do. Um and then last but not the least, the reusable metadata I just talked about.
[00:46:44]
So quickly, some of the challenges that I have not covered um so far. Again, at scale achieving that um trustworthiness is not a small challenge or a small feat. Um you need the standardization, definition, you need all the stakeholders agreeing to what that means for them. We spent months coming up with the definition, uh in partnership with data governance so that everyone was happy about what we defined as these uh dimensions, which was not quite the same as the industrial definition, for example. So we have to um adapt and customize and make that visible as well. Uh engineering for different types of ingestion was uh very much a challenge because stream versus batch, um even though they both handle data at scale, big data. Um they are completely different beasts. So Kafka streams and that's an unbounded data set to a batch data, how it's partitioned, how it's consumed, how it's even created, um is very different to how you stream data. So to build that even for the most commonly used um type of data was not easy. We had to adopt different technical stack much against our um best judgment, but we had to do that. Um the third one is anomaly detection. Uh obviously when you're building this as a central capability, you're uh building an anomaly detection tool that has to learn and detect quality issues across um multiple data sets. So even though we are capturing profile metrics, there is some level of knowledge that these models need out of that data that could vary from one data set to another, from most frequent data to less frequent data, if you're just taking one metric such as the count of events. So the prediction could vary depending on what the metrics that you have used to train that data. So we had to wrangle quite a bit to um make that model work. So it definitely was not a silver bullet that you can put on top of all the metrics and run all types of checks. So we had to be very picky about where it would add the most value for our customers. Um mainly because um a it was physically impossible to run all those different models on all the data sets all the time. Um but secondly, there were some checks that was well known by the customers, they knew exactly what they wanted. So a simple threshold check or a pattern check based on rejects, um would do a job, so we didn't need anomaly detection in that case at all. It was more for where the trend was seasonal um that we needed this. And obviously providing both subjective and objective measures, um and finally customizing our product to suit uh a good customer experience was uh another challenge.
[00:49:48]
Um so this concludes, um running out of time, so I won't go into this in much detail. Uh if you're interested to read a bit more, um here are some of the links to white papers and Zamak uh article. There's a book as well. Um if you're interested to to read, um and other bits that I had gathered and referred to in my presentation.
[00:50:16]
Thank you.