EP 01: The Evolution of Data Lakes
EP 01: The Evolution of Data Lakes
About This Episode
Join Vinoth Chandar, creator of Apache Hoodie, as he traces his career journey from distributed systems to building real-time data infrastructure at LinkedIn and Uber. Learn how the need for incremental processing at scale led him to develop the open source Hoodie project, bringing stream processing capabilities to data lakes. Learn more about Onehouse at https://www.onehouse.ai/.
Know the Guests
Vinoth Chandar
Chief Executive Officer and Founder of Onehouse
Vinoth Chandar is the Chief Executive Officer and Founder of Onehouse, a data analytics company that provides a platform for building data lakes at scale. Vinoth has over 20 years of experience in the software industry, with a focus on data infrastructure, cloud computing, and artificial intelligence. Prior to founding Onehouse, Vinoth held senior engineering and management positions at several companies, including Uber, Google, and Microsoft. At Uber, he was responsible for completely redesigning and rebuilding the company's data infrastructure. Vinoth is a recognized expert in data analytics and has authored several patents in this area. He holds a Bachelor of Engineering degree in Computer Science from the College of Engineering, Guindy.
Know Your Host
David McKenny
Vice President of Public Cloud Products at TierPoint
David McKenney is the Vice President of Public Cloud Products at TierPoint. TierPoint is a leading provider of secure, connected IT platform solutions that power the digital transformation of thousands of clients, from the public to private sectors, from small businesses to Fortune 500 enterprises.
Transcript Table of Content
- (0:30) Introduction to Vinoth Chandar
- (7:55) Data Science, Databases, and Database Architecture
- (15:13) Uber and Apache Hudi Beginnings
- (22:53) Unstructured Data, Data Lakes and Data Warehouses
- (28:26) Data Lakehouses
- (39:40) Apache Hudi Helps Businesses with Data Warehouses and Data Lakes
- (42:50) How Public Cloud Impacts Data Lakes and Data Warehouses
- (45:52) Origin of Hudi Name
- (55:04) Outro – Getting Started in Data Warehousing and Data Lakes
Transcript
David McKenny: All right, so welcome to this episode of Cloud Currents where we grab industry experts across the cloud ecosystem and chat about current events and experiences. So with me today, we've got Vinoth Chander, who I'm going to refer to as the “data guy”. So thanks for joining us, Vinoth. Thanks.
Vinoth Chandar: for having me on, David. And I think I'm flattered by how you're going to call me, but I don't think I'll do justice to that. But happy to be here and look forward to the podcast.
(0:30) Introduction to Vinoth Chandar
David McKenny: So today's topic is going to be all about data. We're going to talk about data science, analytics, warehousing, all that great stuff. And to start, we'll just go into a bit of your, your career journey.
And let's see how I do here. I'm sure you've talked about your career. Let's see if I can sum it up. Got a pretty notable background. So just looking at the last 10 years or so, 2011, you left a pretty good gig at Oracle and joined LinkedIn. Which I had to double-check, but that was the year that LinkedIn went public.
So that had to be a pretty crazy time, a lot of very rapid growth, hyper-growth, if you will, several years later where we hit 2014 timeframe and you move over to Uber, which was also in the very early days, it had to be equally as crazy, I would imagine. I'm kind of curious which one was crazier for you.
That laid the groundwork for what you were doing with Apache Hudi at that time, which is near and dear to yourself. And that's a topic for today. Then post Uber, it looks like you continued that effort for the Apache Hudi project, as well as an effort with Confluent doing some Apache Kafka. Things, if you will, for streaming data, and that takes you to where you're at today running a company that you started called One House.
How did I do?
Vinoth Chandar: Pretty accurate actually. And, to just quickly answer your question off the bat, I think Uber was way more exciting and like kind of like way crazier. Because I joined much earlier in October and kind of when we started, we didn't have like a data team, we basically practically built data infrastructure ground up, that said, LinkedIn was a fascinating ride.
I consider I learned a lot of the things I know today around operating. Reliable data platforms at LinkedIn, because we were building, this was pre-cloud and we were building kind of e value source in a, in an era on brand, but there's no cloud. You have to really power like a. You know, high scale, and like consumer-facing website and we are like a 3 percent team doing that.
So it was kind of like a really, really, good experience. I think I would say at LinkedIn as well. That's awesome.
David McKenny: So out of curiosity, so is your career always destined to do all things data, or was there a, a turning point somewhere along the way? Cause I. I feel personally, like a lot of us go to school, computer science and programming and, and rooted in traditional things.
So where did you start to see that data was going to be a thing for you?
Vinoth Chandar: Oh, that's a great question. So originally my passion was around distributed systems, not specifically around databases. But kind of like a little bit more adjacent, right. And I was interested in all kinds of distributed systems, like mobile networks, mobile networking is a distributed system, right?
So is large-scale, high-performance computing. They're distributed systems too. They're like, so when I went to grad school in UT Austin, I actually had the chance to work on. In two of these areas, one was like a vehicular content delivery network, or like a DTN, as they call it, delay tolerant network.
It was around mobile networking and how we do Wi-Fi, offloading, and things like that. And I also got the chance to work at the Texas Advanced Computing Center. Where they run some of the biggest supercomputers in the country to generate weather reports. And if anybody remembers MPI and all those parallel processing programs from the older days I started my journey in all of that.
We built like, a shell, like a bash-like shell, equivalent, which can do distributed programs and distributed like in a map, use kind of analysis. So I was doing both of these. Honestly, my passion was more towards that was when mobile networking was kind of like exploding. So I actually wanted to go do more around mobile networking, but then I landed my, job and I really liked the initial team at Oracle really well.
They're doing database replication, very hard, very fun computer science problems to solve. So I ended up joining, Oracle and then that's kind of like set me on this track to do. Data pretty much, for the entire time since then, the one interesting thing I would add is, kind of how sometimes you connect the dots, looking backward, I ended up actually running the, mobile, like the mobile networking team at Uber for like, a good one of years.
We were actually solving, the, the way I got into that role was I was interviewing, we've been doing a bunch of people to solve making the Uber app very fast in like low connectivity networks kind of like emerging markets and things like that. And, and it was kind of like this.
Kind of similar problems that I was working on in grad school, but we need to make TCP faster. For like faster connectivity on the go. So it was kind of like funny how I consider myself very lucky to have worked on these two problems in the industry as well, in some sense, that I went to schools for.
David McKenny: They make TCP faster, like, like a UDP, but no, you can't use UDP, we just need to make TCP faster. That’s great because I was going to actually mention that it seems like he got the best of both worlds, the mobile, the mobile side with Uber and the data side with all things that led up to it.
But so you had grad school, you had LinkedIn, you had Uber, I'm going to, I don't know if it's one of those three here, but. What would you say was probably the right place, right time for you? Like, what was that moment where you're like, you were like, man, I'm, I'm glad I'm here that this is kind of what got me where I'm going.
Vinoth Chandar: Yeah, definitely Uber and all the events that led up to us building a Hudi, I would say. Because my two jobs right before that was on database CDC and building a key-value store at scale. So I had basically looked at database problems and we had also at LinkedIn, as you know, the rise of Apache Kafka and stream processing.
And I had like a sort of like a front row seat into that. So when I naturally landed at Uber, I felt like. Okay. I had I'd worked on these problems. I mean, and Uber is like a very real-time business. And we are caught between a data warehouse and a traditional data lake, it was.
Vinoth Chandar: I felt like I was kind of like on the right place with the right kind of previous four years of experience to see the problem in that way, which is, okay, can we kind of make all the data processing on the data lake? Less batchy and more incremental and then work backwards. Okay. What do we really need to do that and unify this all data barrels, data lake divide.
Right. So that I feel wouldn't have happened without maybe like the, the roles leading or the projects leading up to that. And that for that, I consider myself lucky. I would say.
(7:55) Data Science, Databases, and Database Architecture
David McKenny: Kind of getting to see the problem statement just to materialize in front of you. So we'll definitely get into Houdini in a minute here.
I think given your background in data sciences in general. It'd be great to get your perspective on just the, the industry itself that I think the term, I know it's been around for a while, but it seems to really have come into its own certainly recently, definitely separating itself from just what a data analyst might have called themselves.
So. Yeah, I recently happened to read a book that highlighted DJ Patil and what he was doing as a data scientist under the Obama administration. He had a pretty concise goal that he put out there that data science and his role was all about unleashing in a responsible fashion, the power of data to benefit people.
You know, what does, what does data science mean to you? And how are its goals different than maybe other related fields that we've long been used to, like databases and database architects and things like that? Got it.
Vinoth Chandar: Got it. I think, and, and again, right, the, like, DJ Powell, was at LinkedIn as well, I think when, if I'm not wrong, and we coined the term data scientists.
and, and I think, yeah, so my understanding is very similar, which is. It's all about it. So databases are warehouses or links. It's about the actual data infrastructure, the thing that serves and stores that data, right? But data science is all about, I think, kind of unyielding patents, from data, like data to kind of either feed product feedback.
Hey, you should be building like new products in a certain way. This is where your existing customer base or things like that. Or even there are opportunities for field into machine learning or other things where you could be sort of optimizing the, like machine to machine, you could be optimizing a lot of these experiences and end to end, right?
And, and the, the great thing about data science, I think, is it, it kind of needs a lot of other supporting functions, like you need data engineering to be able to kind of move data. Two systems that can, let you analyze, right? Because it's a lot of very large-scale data exploration that is needed to unearth these patterns.
So you need data engineers. You need, sort of like data visualization, powerful ways for you to kind of look at the data and even spot these patterns. So in, in, in. So I think it's a very important, but like a very multi, kind of like functional kind of, area in, in, in my opinion. at Uber, for example, data science grew hand in hand with data engineering on visualization on other side because we are needed to visualize like geospatially, that was not like, uh.
You know, a lot of companies before that didn't have that big a need to visualize data on like areas and hexagon by hexagon and seeing where the traffic patterns are, where are the cars going in and out of, and what is the surge work spread across the city and things like that.
So if you look at it, it's, it's, it's kind of like goes, cuts across all these different disciplines, I think.
David McKenny: Yeah, it's interesting with LinkedIn. Um, Yeah, I'd go out on a limb and I'm, I'm happy to be wrong here, but I would say that most people on the surface don't consider LinkedIn a data company, but when you really think about it, the profiles of what it's known for really became a modern data type of sorts.
So when you talk about the key value work that you did there and sort of setting that stage, do you feel like it gave new, new value to what data scientists and engineers could do? Because. I think that data and now analysts, there's a bit of a misnomer with that title that pretty much anybody assumes that an analyst is working with data.
So it seems like almost like a duplicated meeting here, working data and analysts. So you've got data science and it seems to really have taken LinkedIn forward to a point that I'd say that a lot of folks use it for hiring practices and it really disrupted the market. But on the back end here, there's a lot of data work that's going on as far as analytics.
And I don't know how much exposure you had to it in that realm, but is there anything you could share with us on some of those early days of what LinkedIn was doing as it was going through its hyper growth?
Vinoth Chandar: So LinkedIn is like a on the business side, it, it sells like recruiting solutions, essentially like B two B business, right?
So we had a data analyst, on a data warehouse doing generally what they do, but things really changed, I feel when LinkedIn started building people you may know, job, you may recommend you may be interested in because it's a big, large scale matching problem, right? You have millions of profiles and do you want to connect millions of other profiles too?
You have millions of jobs, and you want to match people too. So I feel at that point, a lot of the traditional RDBMS or data warehouses just simply couldn't do, things like that. Right. And then the kind of algorithms that you needed to run, for example, for people you may know. you need to like do a graph search that you are comparing, like your connections, your connections, connections, connections, and we need to run like this kind of analysis.
Right. So you need a, like completely new, infrastructure, like that is kind of where Hadoop data lakes and all of these things kind of brought came into prominence because you need to crunch a large amount of data. Then you, you are actually like I call them data applications because they're essentially, you're computing results, which you're then loading into serving stores and kind of serving to the site.
And we were right at that point where we were the key-value store that was serving all your people, who viewed your profile, anything that like rhymes like this on the LinkedIn side, where it's a. It's something that we built out of crunching data and recommending that was served out of like the key-value store that we were building.
And I, that's where I feel like it’s kind of like the infrastructure, the kind of engineers that worked on it and the, this whole thing, the, the field that thing bifurcated, right? Because of how sophisticated these ML algorithms are compared to, let's say, what you would normally run on for business dashboarding or BI, right?
So it's that that's kind of like how I how I see it. And it's kind of stayed the same way. I think from that point on, I'm like you said, a lot of people don't. But LinkedIn actually was ahead in this game than most social networks of that time, when they were doing all these like different products and kind of set a lot of the data science in the industry in motion.
Vinoth Chandar: if you, if we go back and look at the evolution of the thing.
(15:13) Uber and Apache Hudi Beginnings
David McKenny: That's fantastic. Cause I, you're right. I, without truly looking, I don't know that I would have drawn that conclusion either. But it makes me think as you're talking here that while a profile doesn't change much, right, it gets updated here and there, there's quite a bit of analytics that are pretty continually changing on the back end.
And it's got to be very different as you come to a company like Uber, where while the profile spans the term of a career time, right, professional career for a person, Uber, you're, you're looking more at a, a trip, right, a unit of measure here. So how did this key value and analytics, this backend morph into what you started this Apache Hudi project?
At Uber, I think I'm starting to see what the problem statement was as you're talking, but it seems like there's some of the same within LinkedIn, but there's clearly a new situation that was presented with Uber and I'd love to hear how you wrote, write, how you got to the. The building of Hudi versus trying to find a solution that was out there.
Vinoth Chandar: Yeah, so that's a great point. So what, what you see is in profile data does not change at that kind of like a rapid rate, but trips do change and it's not. So, what we've seen and kind of, we didn't realize that at first. Uber had a lot of people coming in from Facebook and LinkedIn and like usual kind of like companies that you'd expect in, in, in the Bay Area, right.
To, to come and build. And then we all build like very large data lakes before that, right. But the thing that stumped us was the transactional nature of the data and the fact that it's changing. For example, a trip starts, 20 minutes, right? So the trip starts the rider on the driver a price upfront, but the price may change after the fact of the trip.
Right? The, the, the, the route you take changes like there's many things that change. Maybe you change the payment method, like mid-trip. So we really had this need to, and this was the core data of the company. As you can see, this feeds everything else, right? So we needed to merge these updates. into, like the downstream system very, quickly.
And at, at LinkedIn, for this data can be a little bit staler, right? We get like all the events and the page views and all of these things we, we get in, in like sort of like real time, but, but the actual. Transactional data could be like, can take, like you can do like batch copies and things like that.
At Uber, we simply couldn't afford to do that. Right. And that is kind of what, stumped a lot of us who had come from running like Twitter, Facebook or LinkedIn kind of data and fraud at that point. And, and then we actually pondered over this for a while before finally deciding to kind of build a system around it.
because this was like a pattern that has not been broken in the last 10 years before that. So there was like a lot of like debate about, hey, we didn't need updates or these things for 10 years in the Hadoop plan. Why do you need this now? Right. And, and we had to like walk through a lot of those questions, but, for me as well, the other angle is.
Uber is a really truly real-time business, like the I know it sounds cliched, but think about it, right? If the weather, the traffic changes, if there's a big event and there's, it rains. The weather changes, things are changing in the real world, which means people are somebody who wants to take public transport is now taking Uber.
And then that changes demand that has to now apply affects supply. And then the pricing changes, everything changes. So for us, we, from the start, we were like, okay, we need all the data in near real time, right? As fresh as possible in our data lake. because we, just as the first principle, I think it would help us, right?
If anything as even the real world changes, we can respond very quickly. Right? So we also looked at it from this problem of, I saw firsthand what stream processing did to a company like LinkedIn. And how the power of data streams kind of like was really amazing and transformative, right?
But can we now do this with the scale of the data lake? Because if we moved all of our data to a stream processing system, we just cannot afford that, right? It's going to be super expensive. We can't, we, we to run everything in like real, real-time. We the columnar, file formats. We needed the horizontal scalability of the data lakes.
We needed the compute scalability of the data lake, but can we now build a system which can now provide, the capability of the warehouses around, the transaction capabilities, updates, and all those things, but also can it now, help us change some of our, , expensive ETL jobs, and data transformations. Into a more incremental model, right? And I think five years now, what a team that Uber has achieved is they've moved all of their core warehouse, processing into a more incremental model. And they've gotten, very large gains out of this. So, I think I would say it's a, it's a combination of.
a bunch of requirements around Uber's business that got us to build some of these things. And, like, the one thing I didn't touch upon probably is The, the, the regulation in the business, it's a highly regulated business. So we may have to delete data long before GDPR was like a thing in the industry, right?
So we built Hudi way ahead of time. And for two years, it was this like nerdy little thing that like the Uber engineers built. But then when GDPR happened, everybody started to see the light, which is, oh, I can't treat my data lake as like a dump of files. Right. And, and we needed to have efficient, like format standardization, all these things are great, but you need to have efficient way to be able to delete your data and like manage your data on the data lake.
So those were, I think that two, three, I say industry trends and the business requirements at Uber that kind of forced, I think this category now.
David McKenny: Yeah. That's really interesting. So it's like, you need the power of what a data warehouse. provides in reporting, but you can't wait for the normalization of unstructured data. You've got new data constantly coming in. You need to report on it in real-time. So you've got a use case here that demands that, but it sounds like there's really not a lot of tool sets or options out there to go, to go buy one. So if you build one, if I have that
Vinoth Chandar: Correct. And what we've seen is, like most people who have transactional data, right.
And I have been able to apply the same. pattern basically, take, amazon. com, Walmart, like Robin Hood, anybody who has this kind of like transactional data, essentially very high-scale data, right? Millions of transactions, millions of orders and stuff like that, which are all like changing. I think they build some very large data lakes out there using, Hudi, which, which powers all of these in like near real time.
(22:53) Unstructured Data, Data Lakes and Data Warehouses
David McKenny: Okay. So as you're talking, the thing that came to mind here was it seemed like data lakes and data warehouses really as separate things. And we know that some of these are coming together in the form of a lake house these days under a new term. But what you're describing is the need for data warehouse-style reporting in like real-time on What sounds like unstructured data, right?
You don't have time to wait for the ETL or ELT process to hit a data warehouse. And it just strikes me that anybody who's working with a data warehouse of a traditional sense would see this and say that this is the next coming of what data warehouse does. And maybe that's an overstatement here, but I'm thinking of data warehouses in my mind were always a thing that you brought data to.
I build my data warehouse. I bring the data to it, and it does the data warehouse things. And it's almost like you're describing that there's a business case in a data problem. And I want to bring all of this data warehousing and analytics to the data is. Is that a good way to think about this as we see the growth of unstructured data that we're bringing the power to the data versus the traditional reverse?
Vinoth Chandar: Yeah. So, actually, this is a very interesting thing to understand. The data lake house is actually all about structured data so far. That's that should kind of like make us like sit up and think, right. Because so the two angles to this, the lake, the data lake on the data warehouse.
And then the other thing is structured versus unstructured data, right? And I think there are actually slightly separate problems. So let's start with the data warehouse. What's a data warehouse? It's been there out there for decades. It's a specialized database, basically, where it's a specialist relational database, essentially, where you can move your, from your online database or a OLTP system to that.
it's optimized with columnar storage. You optimized for crunching the data and reporting, rolling up, counting stuff, basically, right. The analytics. So, and data lakes, like I, I mentioned before, they mostly evolved to kind of process large amounts of files, if you will, in a put it in a very crude way, right.
And the, the main and that had been traditionally structured, data, like Our semi-structured data, CSV files, JSON, something like that, are completely unstructured data like through a bunch of PDFs in there. And then all PDFs are different. There's no structure to any of them.
How do you process? So you could do all of that, right? The data lake. The data lake house effort is so far in the industry is around how we deal with even the structured data. How do we deal with, That more effectively right across different use cases. For example, the LinkedIn, like use cases that we talked about, none of them are unstructured data, right?
Because all the events that we are emitting around, let's say page views are your profile view events. So anything that we needed to build these data apps, as I call schema. They had like some kind of like a nested, structure and we could actually store them in and then help, help us store them in like, kind of open file formats and then run jobs on top of them.
Right. What has changed in the last five years is there are open columnar file formats, like, Apache Parquet, Apache ORC, which are broadly adopted. And now, it's all, all this is about is, can we bring these Barrows capabilities, on top of those open file formats? And make data science, machine learning, analytics, all sit on top of one single copy of the data.
That is pretty much what the, what the Lakehouse, vision as so far achieved in the industry, right? this whole unstructured data into the Lakehouse, that journey is just starting and we are actually working towards a bunch of this in Hudi now, where we are actually expanding Hudi to include more.
There's a Hudi 1.0 that we are, that our team is currently working on, where we try to include images, videos, or like LLM (Large Language Models) vector indexes, all these kinds of other. It's unstructured data types. We can also bring them into the, the data lake house paradigm, if you will. Ultimately, it's about, can we decouple the compute, like the storage, the data management on the storage from different compute engines.
That's basically it. I, I reckon data warehouses are going to be really great at BI and analytics and these kinds of traditional BI reporting that every company basically needs for a few, like, right for, for a few years, right. And then there are specialized engines, for data science and ML, and there are more evolving around the AIs, the stream processing engines evolving.
Can they all fit on top of a single copy of the data? That's the thing that we're probably all of us are working towards, I think, in the industry.
(28:26) Data Lakehouses
David McKenny: So let's talk more about the data structure side. You mentioned structured data, unstructured, semi-structured. And in Lakehouse, we're talking about addressing that semi-structured format, it sounds like, that we're, we're trying to take an approach that's not so rigid in the row and columnar space.
Where does the unstructured data like preserving that model start and stop as you mentioned, trying to work from one copy? Are you saying that one of the big challenges right now is trying to work from that unstructured data set and preserve it or? Does that eventually have to find its way to semi-structured and structured data along the
Vinoth Chandar: Great question. So yeah, I think most companies, what I see is they, they need to turn their, let's say they have a JSON database upstream. They need to kind of like attach some kind of structure to get it into a form. into a lake house or a warehouse, to be able to do like meaningful things with that and process that more reliably.
What really happens is unstructured data, like when you query as it is, is the data can change in like kind of very incompatible ways. And then your analysis can keep breaking if, if we don't kind of bring it into some, some kind of like structured form. It's, it's, and also, it's less about like you have, even in structured data right now, you have like nested data, right?
A lot of these file formats that I talked about can support in a nested. So, structure. So you can take JSON and turn it into like a parquet file if you want, right? With, with the same kind of structure of a JSON. So it's, it's more about making sure the data is evolving in sort of like compatible ways, so that the, the, the business.
Process don't break, like they don't keep breaking, like whenever you change something in an upstream database. So I think that that's the main issue that we, around this, that we see people, dealing with, right. The, there's companies who have like a lot of image data and, let's say, think about like healthcare, companies, right?
So they all, the, the problem of data management there is less about this. It's more about, I have images and I have surrounding metadata. How do I get this? In, in, and store and manage it in a way that I can change the metadata and the image in a, in a kind of like a consistent way, for example, you are like some patient data, you may want to attach like some extra metadata to an x ray as well as like, maybe you, you replace the x ray, right?
You, you have a newer x-ray, maybe the old one as well. How do we like replace the image and the metadata, like in consistent kind of like ways. So that is the bigger problem around unstructured data management. Right. And like I said, like the Lakehouse technologies today have addressed only the structured part of it.
Like if you take a look at Hudi, we have easy utilities for you to take CSV, JSON, XML, like all kinds of these like semi-unstructured, semi-structured, structured data at schematize them and, and like make. well-optimized tables on the other side, you are able to bring them in front of warehouses, lake engines, like all the data processing is like sparkling, all these different things.
So, but unstructured data management is an active area. I think we need to go on tackle next. Um, but it's, but it's, it's very interesting. It's evolving because there are so many data types, right? There are some 10 different image types. There is some five, six different audio-video types.
It's like, It, it's an evolving space and a lot more diverse and complex to wrangle.
David McKenny: I would say. It makes me draw a comparison here to like what's going on in the generative AI space and just machine learning and, and how we are labeling data like that. So when you, when you're talking about needing to manage this metadata around the data itself, it seems like it's a very similar problem statement to how we organize the, the metadata around the data itself and, and, and manage that.
When it comes to the transactional nature. So the idea that we bring transactional database things to a data lake, what were some of the core pillars of like OLTP type workloads that you. Needed to make sure we're tried and true with HUDI that were no exception where they rooted in performance, where they rooted in a single copy of the data.
You've talked about quite a few things here, but what are some things that are core to Hudi and how it operates that are, are, nonstar is for breaking?
Vinoth Chandar: So, that's a good question, right? So when, whenever the, the first thing is whenever we think about transactional data, we are, I think we are unconsciously, we kind of relate to relational databases and, and royalty P systems.
And our view is that is not the right approach for the D lake or lake house kind of model because these are high throughput systems, right? And you're not probably, you're, you're doing like large scale analysis or even like. You're even your analytics queries are crunching through a lot of data and giving you like comps, right?
So, this is not the same problem as you know, like having a Database behind your app that is taking orders and doing all these like, you know online kind of real kind of applications. So that was the first distinction we make. And that is where actually we see when we look at like other projects in this space, like they, they build all these, like stuff like, serializability.
The relational databases talk a lot about database serializability. Serializability in these kinds of high throughput systems with multiple writers is very, very bad because your large processes. which are going to fail and lock on each other, and like waste a whole bunch of compute. So the way, we thought about this was how can we, bring essentially, more, borrow more from stream processing in terms of how they've dealt with high throughput event streams.
And how we can kind of commit data atomically into a table and expose ways for us to sort of keep managing the table, like, in the background, while we continue multiple writers to keep like writing data, like. You know, like, continuously to the table. So our concurrency, we try to approach different concurrency controls, like MVCC models, nonblocking concurrency controls that we borrow from database literature, which are slightly different from the traditional, relational databases. And, and for me having worked with like a relational database and like a key value stores, and I was like building some like real time data store for a little bit of time, I understand these like subtle differences in concurrency control, how they. Kind of matter.
But the, the way we approach it was as a, they're going to do high throughput writing. They're going to be jobs, writing data in high throughput. How do you make these jobs commit data atomically while allowing for larger concurrency, even if it sacrifices strict serializability, right? So those are some ways that we, we thought about sort of like we still, we are still building, let's say multi-table transactions.
for example, it could, could have been easily built before, but the use cases aren't just there, for like in, in the data lake, it's a, it's a very, what I described works for almost 99 percent of the use cases out there, which is a single job, which is computing or running an ETL, committing to a table.
And then a whole bunch of background processes or backfilling data or. But it's all about doing some maintenance on the tables. That's how we approached it.
David McKenny: Great, lots of updates. So if I understand that correctly, it's really, you want to preserve it. You don't want to mess with the rights. You want as many rights coming in uninhibited and Nonblocking, as you said, and does that mean that the, the read operation, because this is real time, you want to be able to read from that same data that you're writing, that you have a lot of writes coming in, but compared to the writes, the read might be from a copy that isn't up to date, as you're just reading from whatever Is most recently available on the copy of data.
I say copy of data. I should say the data, cause we're working from a single set of data, but it sounds like the rights are far more important than the read and the read is just benefiting from the real time nature of the platform.
Vinoth Chandar: So for reads, you basically provide snapshot isolation, where you say you're reading the latest committed state of the table, right?
And, and, and that state may be changing over time. And, and think about it, these reads can be like very long. Let's say like a large machine learning training pipeline, or these queries can be very short, like a dashboard that is just like trying to look at like some numbers, right? So you need to also build like retain enough snapshots over time.
So that as, as the table is changing, if you go and delete like the previous snapshot, those jobs are going to fail after running for seven hours, right? Which you, you lose a lot of compute, all the compute power you spend. Like this is now lost midway. So there are like, so for reads, you have snapshot isolation, and you need a powerful enough metadata system that can retain enough snapshots and take like, sort of like more, again, maintaining increment metadata more incrementally.
David McKenny: The whole merging process of the data.
Vinoth Chandar: Right. Those are some like very specific design considerations to like, again, design a transaction layer for these workloads, right? Um, and, and we, without like kind of blindly, talking about kind of relational database concepts that we are all familiar with, growing up in computer science.
So, for example, like no SQL had to sacrifice a little, like some of the transactional, like the multi-row transactions, even. to get like availability and other things going. So database systems are always a tradeoff for different workloads. And, and for us, we need to make the right tradeoffs for the data lake workloads.
That's kind of like the main, main point.
(39:40) Apache Hudi Helps Businesses with Data Warehouses and Data Lakes
David McKenny: Definitely. And so that you write like trade-offs, CAP theorem and things like that in ACID. You talked about 99 percent of the use cases here, maybe go into this a little bit more, what, where does somebody need to be in their data warehouse, data lake journey to take advantage of what Hudi can do?
Is it something that everybody can, everybody can benefit from it today or is there? a certain stage you need to be in with your data to truly take advantage of it. If your scenario, maybe that's the other part here. I think we've talked a lot about scenarios just as they come up. But scenario and I guess your position with data today, how do you find the fit for Hudi?
Vinoth Chandar: Yeah. So I think if you have a data warehouse today, I think if you're thinking about You know, adding more use cases, because if you have a data warehouse today, you're probably doing like traditional BI kind of use cases in your company, right? The data warehouse probably works well for that as long as you have low throughput data or low-scale data, if you will.
Um, so typically what I see, based on the Hudi community is people outgrow their data warehouses. either when, or if you have, let's say a no SQL data store. Um, if you move from like a small relational database to a NoSQL data store, then suddenly the warehouse cannot absorb, like can't keep up with those things and becomes like very, very expensive to do this ingestion and the initial data prep kind of stages, right?
So those are a good point to consider warehouse cost. And looking into non-BI use cases like data science machine learning, you absolutely need to have like a, like this kind of a data lake house, based kind of architecture where you can bring multiple engines, right? Because something like Apache spark or like there's a lot of Python libraries around data science, which are like really great.
You get a lot of built-in algorithms around data science right off out of the gate from them, right? You don't want to be hand-building all of these. On like a like a barrels on UDFs and things like that. So I think, yeah, if you, and if you have a data leak today, and if you have large batch jobs, which are kind of running up your cloud costs.
And those are, again, a good reason for you to consider something like Houdini, which is kind of right now at this point, like pretty industry-proven for these kinds of incremental workloads, right? So instead of, let's say, running a batch job every eight hours and building some, like, table on top of the data lake, you could be running that job every I don't know, like a few minutes, right?
Every 10 minutes, every 15 minutes and Hudi is able to kind of do that, build the same tables incrementally because of all the indexing and all the different things that we talked about around like the transactional layer, right? So those are like two different for existing data warehouse data lake users.
I would recommend looking seriously into something like that.
(42:50) How Public Cloud Impacts Data Lakes and Data Warehouses
David McKenny: Good. So you mentioned public cloud there, or you mentioned cloud. I'm going to take it to the public cloud actually. But how do you see public cloud services impacting this space? And object storage, definitely with Amazon S3 has been a pretty impactful thing in our industry.
And certainly stores a lot of data. But are there other things that public cloud and the services that they're bringing to market are doing to impact the space either for the, to the positive or maybe they're creating new trends or problems needing to be solved? So all, all in the light of good things here, but what is the general effect of public cloud services in driving data usage up and maybe what it means for the future of Hudi as well.
You've talked about how it's evolved to date, but maybe where it's going as well.
Vinoth Chandar: Yeah. So I think we, we, for Hudi, we work very closely with Amazon since, 2019, because again, like going back to like amazon. com, by itself, like using Hudi for like some very critical kind of like use cases.
So. And the way I see this fitting into cloud service providers is for the same AWS 2019.
I think CSPs in general, they usually have multiple analytical services. That you are using for different use cases, like how it being like emphasizing to this entire podcast, right? If you look at AWS, there's Athena for analytics, that is Redshift is like EMR to do spark and data science and like EMR also now supports Flink for stream processing.
So there's all these, they usually have like a portfolio of query engines, if you will. And something like Houdini, really can serve us that central, like more of a universal data layer that goes across all these different engines, and they're like more naturally kind of aligned for it's in that customer's benefits to have that one single copy of data.
And, Amazon has done like a really good job, I would say, integrating Hudi, into things like AWS Glue, event. So you can easily move data, into Hudi tables if you wanted to, like in a bunch of, bunch of different, places, right? And, and we are like similar kind of, pre-installed, installed into for other cloud providers, including, Google Cloud.
You know, the Hudi was long out there installed in cloud providers before we even started like one house around it, like building on the idea, but yeah, it's, it's already there. It's already happening. it's already like, more of a mainstream, uh trend that people are adopting a technology like that, for these purposes.
(45:52) Origin of Hudi Name
David McKenny: So you mentioned OneHouse there, and I, I wanted to ask this question actually first, earlier on, but Hudi, I'm kind of a guy who likes to know the origins of a name, but where did the name Hudi originate as far as the project?
Vinoth Chandar: Yeah, so it was basically it's like an interesting story behind it.
So in the HUDI stands for Hadoop, Upserts, Deletes and Incrementals. This is kind of the core capability that we are like adding to the data lake. That's how it started. Although internally, I mean, at Uber back then we have this thing of giving everything like a cool code name, if you will.
So. Initially the project was called Hudi, Hudi, the clothing, right? And then that's how we actually open-sourced the project. But then there was like some name clash, or somebody had like a name clash of the product. So we went back to the, the, the acronym basically, but we are already, we are already in, production with Hudi by that time.
So if you, you still see like a lot of our code-named, like the Hudi is in the clothing while the project's official name has remained like Hudi for like a while. Just like an interesting kind of twist once you open-source the project.
David McKenny: I love it. It works pretty well with the clothing line, right?
Give Amazon a run for their money on their re-invent, jackets. You need to get the Hudi going for folks. Should be a pretty popular thing. Yeah. Let's talk about OneHouse. So you've got OneHouse going now, which is around, Hudi, services. what's the focus there? So Hudi as a technology is certainly tried and true.
If it works for the likes of Uber, I don't know that anybody's going to question whether or not the solution works, but where do you find your time spent in OneHouse and why did you see a need to, to get that started?
Vinoth Chandar: Perfect. So the thing is we didn't conceive OneHouse as a managed Hudi company actually.
So OneHouse was born out of this pattern that I saw in the Oasis Hudi community for four years. right. We build a community, ground up and we are building like a grassroots kind of open source project. Right. And, and it was mostly just because it was fun. It was like purely weekends and nights, outside of my day jobs that like were like all the other places.
But the pattern I noticed was people. You typically have these things with existing data lakes or data warehouses. And they, they, they come to the community to start building this architecture that I'm talking about, where I'm basically advocating for, let's move all your event data, your transactional data and any data that you have into one kind of like open.
data layer first, and you should have the optionality to go attach many different engines for different use cases, because having been through that ride at Uber and LinkedIn, we know that inevitably you're going to get to that point. Right. But a lot of people reach that point and join the Hudi community, pick up the technology to help them build that architecture.
But what I observed was it takes them eight to nine months or up to a year for kind of like engineers to pick up on like 10, like four or five different open source technologies you need to learn. Like at least one of Spark link really well. You need to know how to like run Debezium well to change capture.
You need to know how to manage Kafka well. You need, you need to know, understand Postgres, let's say, and it's like change capture and all these different things. You need to understand how S3 scales, and you need to understand how catalogs work and how these different engines work. Right. So I felt this was like a very insurmountable bar for data engineers and like to, and platform teams to kind of invest.
Like a year into to get this architecture. So why not just more about. Standardizing this kind of like a, more of a, what I, what we now call a universal data architecture, where we say, let's have your source data, right. And they are like first two layers of your data stored in purely open formats, and we have built a managed service to kind of fast track you there.
So it's one house you point-click. You can bring all kinds of data into, into your data lake house and we built even broad interoperability across even other competing lake house, kind of like storage formats and other projects right now where we have complete universal interoperability. It can connect to any engine, any warehouse out there.
And be able to do like and then, and then you have the freedom now, right? You, you have can pick whichever engine you want to buy. because the same companies would be using, for example, open source engine, out of open source in one team. while maybe like data science team gets like, like the more premium, you know, compute, you know, backed by a vendor because they need their jobs to be faster.
And then you want to do some ad hoc analysis and you don't want to like use a different engine for that. So we see how that the buying process for these query engines. are actually very different across use cases and price performance needs. So we want to enable that right across and then Hudi is a foundational technology, the storage layer that makes all of this happen, because without Hudi, we won't be able to ingest, like bring this data very fast into a data lake house and kind of like incrementally transform it, cut down costs.
So essentially it gives us all this like. Infrastructure goodness that, like you said, most of these other big companies have reaped from Hudi. But what we are really building at OneHouse is we're trying to make this the starting point for anybody starting on cloud data, instead of picking a proprietary warehouse.
Migrating two years down the line into a data lake, a journey that again takes a year, right? So we just thought that it's like a better way of doing, building this architecture in the cloud.
David McKenny: The open source landscape is, is pretty, it's pretty daunting when you look at the, the options and, and all the things that are being done there.
It actually makes me wonder, because I don't talk to too many Apache projects, but. What was it like running the Apache Hudi project for those that aren't really close to how those things function and work, but you've run that for some time now, love to hear just some insider info about what it's like to run an Apache project.
Vinoth Chandar: Yeah, I mean, that I think has been like a really big journey for me. And interestingly, before three, I actually not been, like a committer or a conduit, I don't have like very first-hand experience in a, in a management committee, in, in, in an Apache project. I ran open source for a while before that, but not specifically Apache.
So I think, it's been a great experience actually, because. Apache so has puts individuals, it's like community over code. And for a project like Houdini, where we want the storage to stay neutral, in the industry, it brings us like a really good forum. right. Because we have a project management committee with members from four different cloud providers, a bunch of consumer internet companies that there's like us.
You know, it's a lot of effort to build a community., but it's actually very rewarding to see, when people take, the open-source project and go like, build, cool things. And we can see the impact. You know, off of it. Right. But it is, definitely a lot of investment.
I actually don't have a lot of drama or interesting nuggets to share because it's actually been super pleasant. We have a very healthy community that is like very friendly. people help the community helps each other. That's, that's the greatest thing for me to see is like. Okay, sure. I'm incentivized to help other people because started the project and like the bunch of people who work on the project helped that.
But when you actually see users helping users, with like, tips and how to do this, that is when you, those are really good, , moments, I think for, for a community-based thing. For us as a company, we, a lot of people ask me this, right. But we're still supporting Hudi, the Hudi community.
We have dedicated resources to helping people. If people still think they want to build all this by themselves, bare hands, they should be able to do that. Right. But if you want a faster track, we are around. That's how we, we look at the open source versus um, so it's been, it's been great, it's a lot of effort though, that that's, that's what I would say to anybody who wants to do that. It takes a lot of perseverance, through years to build a community. It's not easy.
(55:04) Outro – Getting Started in Data Warehousing and Data Lakes
David McKenny: I think that's great feedback. I love the community that kind of set forth the perpetual machine that it becomes self, self-fulfilling and, yeah, it totally makes sense where either do it yourself or bring in services to help get you started on your journey faster.
So we've got the Apache project out there. I'm sure people can go find more information from the Apache Hudi site in the community that you've referenced, but taking us a slight step back, if someone is relatively new to data lakes and, and even this transactional component here, where would you, where would you send somebody if they're wanting to learn more or get started in this area of data warehousing and data lakes?
Are there some good resources that you would recommend even your own?
Vinoth Chandar: Yeah, I think we, we have like plenty of quick starts and things like that. And then for example, if somebody, typically somebody starting this is either like, let's say a data science kind of person or a data analyst. Whose car crossing over from the warehouse to the league.
All right. So or a backend engineer who's now trying to understand data. So for all of those people, I think a good central, like good point for us is we have very easy tools that let you move your, kind of your event data or your, any, any file sitting on cloud or quickly build like a table.
And then like, try to integrate this with a bunch of different catalogs. And then you can now start querying and start understanding these tools. Right. So we, so we are in some sense, a good entry point project for anybody trying to build a data lake. Right. But beyond that, the, the one advice I would have been a very diverse space with a lot of different tools.
So try to pick one, two tools, and then try to understand them more well-rounded deeply before you go for breadth. because once you do that, I think a lot of the other tools you can learn by just comparing and contrasting with the things that you know already. So I would start with something like Apache Spark or Apache Flink, or one of these data processing libraries.
I would try to understand, some of these query engines Presto, TreeNow these or even like some of these even BigQuery has like really good external tables right now. So one of these query engines and try to write programs using both like in code and as well as SQL, try to get like sense of, The, the price performance and like how that thing really behaves, especially when you're coming from data warehouse space, I think you will see very short queries can be a little bit slower on the next.
I know you try to understand the play enough to understand one or different. Right. when you come from a OLTP space, you will understand that they're all like going to scan data. So if you do like, there are no, Hudi has indexes, but most of the Glacos technologies don't have indexes. So if you don't have an index, then if you write like a normal database, simple query, it's going to scan that, right?
So you, so you'll, you'll see all these like subtle differences that, that stump them. And I've seen them a lot of times in, in the community as well. So I would, encourage people to pick two, three things. pick a storage layer, pick a compute, like query engine and learn them, expand, spend a few weeks on that before you kind of like go shop for bread.
David McKenny: Yeah, I, that's really good advice. I, I, as somebody who's got a lot of projects started around the house that are 80 percent done, I can definitely attest to the value of getting a few of them done first before going on to the next, but. I think we all succumb to the click-happy nature of reading things, studying things, and then, and looking at what's what does this mean, and then before you know it, you're on a whole new technology set, but yeah, context switching in the learning process is definitely difficult.
And you're right, there are so many tools that are out there, so many solutions and platforms, it, can be overwhelming, so. I think that about wraps it up for this chat, this fantastic conversation. I know I learned a lot. I think, for, those who can, or don't know, make your way over to the Apache Hudi Project, the site, as well as OneHouse, see all the great things that Vinoth is working on or has worked on.
You've given me a lot of stuff to think about. I, I, I have no doubt that for the foreseeable future, every time I'm looking at my phone for that next Uber ride that's arriving, I'm going to be thinking about the transactional upserts, inserts, deletes that are happening to a data lake on the back end, and I'm going to say.
I know how this works, kind of, sort of, and maybe I'll try to explain it to the person next to me. So, with that, we'll catch everybody on the next Cloud Currents discussion. Thanks, and I look forward to seeing you again soon.
Vinoth Chandar: Alright, thanks for the fascinating conversation and the very interesting questions. I look forward to connecting again soon.