October 15, 2020
It is well known that machine learning is powered by data. Unfortunately, the raw data that we would like to use to train models is often created and stored in such a way that it is not machine consumable. As part of our Determined Podcast Series, Craig and I recently had a conversation with Joe Hellerstein, a computer science professor at UC Berkeley, a leading researcher in the databases community, and co-founder of the data wrangling company Trifacta. Joe talked about the unique challenges of data preparation, and how it blends ideas from software engineering and media editing. Here are a few highlights of our conversation, and you can listen to the full episode below or stream on Spotify, Overcast, or Apple Podcasts.
Read the full transcript here.
In any reasonably mature environment, you have a million formats of data and each one of them made sense to somebody. But by the time you get working on it, you’ve got a hodgepodge… At the end of the day, anybody who’s working with data has to deal with the messiness that’s in that data. They have to make sure that things line up so that they can be analyzed. They have to make sure that things are coded uniformly. They have to make sure that outliers and strange readings are taken out of the data or rectified.
In 2012, I was doing research with some colleagues at Stanford, who work in human computer interaction. And we were studying this problem that people who had data didn’t seem to have the capacity to get it into a shape where they could work with it, to be able to plot it, or to be able to run analytic functions on it. And we were thinking at the time about users, like journalists who were trying to work with data to support stories that were important to people… And in essence, what we discovered was that these [were] non-programmers, but fundamentally cleaning up data is a very time consuming and very boring programming task. And so, we asked ourselves, can we use a combination of visual interfaces and intelligent AI algorithms in the background to make it possible for people to take their messy data…and get it into the rows and columns you need to build a chart or run a machine learning algorithm. And so that work was called Wrangler. And then we saw great uptake on the open source of that. So, we went ahead and built a company called Trifacta to commercialize the Wrangler technology. Today Trifacta’s Wrangler product is [used] in corporations all over the world.
We have a strong point of view on this idea that data preparation is a collaboration between human intelligence and AI. That basically humans can’t do all the work themselves. It’s too tedious. There’s too many things to keep track of. At the same time, there’s no automated algorithm that takes input data and produces good data. You can’t say Siri clean my data and expect to get anything good out. And the reason for that is essentially that data is a medium. You use it like you use clay on a potter’s wheel: you shape it for purpose. And so, depending on what you want to do with the data and depending what’s in the data, you make decisions, design decisions, essentially, about how you’re going to transform that data for use. And so, you need to be in dialogue, essentially, with these algorithms that are analyzing the data to try to figure out what the outcomes should be. And so that’s a sort of lovely dance between analytic algorithms in the background and human insight in the foreground. At the end of the day, you want the people who know the data best making decisions about what the data should look like.
[There is] quite a bit of judgment that goes into data preparation and it absolutely has effect on outcomes down the line. I don’t want to over emphasize it in the sense that some algorithms are more robust to dirty data than others. Some tasks you’re trying to do are more robust to dirty data than others. So sometimes you want the quick and dirty and you take the data and you just kind of get it to be in an okay shape and you move on. Sometimes you really need to do creative things to the data to be able to see what you want to see with your algorithms.
What’s interesting about it is it’s a bit like software engineering, but it’s a bit like media editing, like film editing or audio editing in the sense that you’re looking at this thing. You really want to kind of poke it and shape it in ways that are pretty visual. But at the end of the day, you’re building a program, a script to transform the data using algorithms. And so that back and forth between kind of the visual intuition and the programmatic expression of a recipe for cleaning the data is where some of the magic lies.
The first is simply, is the work that I did yesterday actually reproducible, like deterministically? … It’s very important that whatever tools you’re using or whatever processes you’re using documents your data transformations in a programmatic way that you can rerun. And so, absolutely in Trifacta, it’s essentially generating code… we call it a recipe. It’s a program for manipulating the data…. And it’s kind of table stakes, I think that’s basic. If you’re going to do a governable, reproducible data analytics or machine learning.
A second question you asked, which is, I think, deeper is: suppose you took two different people, you gave them the same data set and the same task, and you locked them in two separate rooms. Would they prep the data in the same way? I think the answer to that’s almost certainly no.
Get in touch with Determined if you’d like to hear more! We’ll be adding new podcast recordings bi-weekly here on our blog, as well as on the major streaming platforms mentioned above.