Alex Ratner on Programmatic Data Labeling for Machine Learning

In our last episode, we talked about the importance of data preparation in machine learning. Our conversation this week follows a similar path with Alex Ratner, co-founder and CEO of the data labeling company Snorkel AI and an Assistant Professor at the University of Washington.

We talked with Alex about Snorkel’s programmatic data labeling approach, the evolution of bottlenecks in machine learning, and our common goal of helping folks develop AI applications faster. We’ve included a few highlights from the conversation below. For more, listen to the full episode by clicking the link below or via your preferred streaming platform!

Overcast

Listen on Spotify

Listen on Apple Podcasts

Read the full transcript here.


On the importance of data labeling for deep learning

Snorkel is motivated by modern machine learning models that have a lot more power and push button capacity…deep neural networks are a great example of this. They automate, or obviate really, a lot of the tasks that used to be so tricky, like picking out the right features of a text document or an image that the model needs to look at… they have hundreds of millions of parameters, hundreds of millions of knobs to tune, to learn the right configuration for. And they need commensurately more labeled training data to power this.

On the beginning of Snorkel

The key idea behind Snorkel is just to, first of all, make this data labeling and creation process the first-class citizen of the system and of the ML development process. And second, to let these users, especially subject matter expert users, programmatically label the data using what they know. Using things like key words or heuristics or patterns or other sources of signal, rather than just through labeling data points by hand.

On the importance of programmatically labeling data

The benefit of this - of essentially labeling data with programs rather than by hand - is [that] you can do this a lot faster. You can write a couple dozen, in Snorkel [we call them] labeling functions, rather than labeling hundreds of thousands of data points by hand. You can do this in a couple of hours, rather than a couple person-months. You can very easily tweak or modify the way you’re labeling the data, which you just can’t do with hand-labeled data… it’s more privacy compliant, because you can programmatically label your data rather than having to ship it to humans somewhere often out of your org.

On achieving scale, and the time and cost downfalls of hand labeling

The most common setting we see is…imagine labeling the data by hand, and you could imagine doing it and maybe spending a month or two, and you could stomach that cost. But it’s the fact that in a month, or a week, or even a couple of days, you’re going to have to go back and relabel it again that is actually the most painful part. And the ability to, instead, go back and just change some code or tweak some templates with a GUI. And again, I think there’s a thing that the field of ML systems and just practical machine learning in general is realizing is that getting a model to a certain accuracy, like we’ve optimized for on the academic side, is one thing, but maintaining it over time and adjusting it, and serving it, and retraining it. That’s the bigger cost over time.

On the complementary nature of Snorkel and Determined

The places where we blow traditional supervision out of the water is where maybe you have 50,000 data points labeled by a radiologist, but you have 500,000 unlabeled ones sitting on the hospital server. You can now label them with Snorkel Flow for no extra effort other than some compute cycles. …That also then makes the training burden that much harder. We’ve learned it’s very important to support the ability to train a model, see where it makes mistakes and then go back and iterate on your labeling functions as a complete loop…When you have a problem with a lot of scale that’s a serious production problem, after you’ve done some of those initial iterations, we want to be able to couple them with the best in class approaches for those next stages of the pipeline, like Determined, to flow into.

And so, I think we learned you need both pieces for a Snorkel-like approach. You need to have some way to quickly iterate with just a simple model and then you need to be able to use best-in-class tools and platforms to get to those second two stages.

On the shift in machine learning bottlenecks from feature engineering to managing training datasets

Imagine you have a logistic regression model that’s trained over 100 hand-picked features. You basically have 100 weights to learn, 100 knobs to learn [to get] the best configuration from some data. And you could imagine even doing that as a human you just look at some data that’s labeled, tune and tweak the knobs. Now imagine you have 100,000,000 knobs to tune. You’d need to look at a lot more data to find anything near the right configuration.

And what does that map to in real-world experience, from the machine that the systems pragmatic perspective? If you go to a random organization and that has a data science team or ML engineering team and say, ‘What’s blocking you today?’… Five, ten years ago, [the] most common answer would be, ‘We’re trying to pick the right features and we’re building the feature extractors’… Today, it’s much more like, ‘We’re waiting for our sister team to label a bunch more trained data, or to fix a bunch of training data.’


Are you enjoying the podcast series? Are you a machine learning developer and want to get in touch with us? Drop us a line at ai-open-source@hpe.com, join our community Slack, and stay tuned for another podcast in two weeks’ time.