November 12, 2020
Over the last several weeks, we’ve spoken to experts working on various aspects of the machine learning pipeline–from data wrangling to programmatic data labeling to specialized AI hardware. In today’s episode with Determined CTO and co-founder Neil Conway, we discuss one of the central components of the machine-learning pipeline, namely model training and development. In particular, we consider the novel challenges that arise as people get serious with deep learning, the crucial role of startups in tech, and the excitement surrounding open source deep learning projects — tools like Kubeflow, Kubernetes, and Horovod — and how Determined fits into the conversation. Read on for more highlights from the episode or listen to the full episode via your preferred streaming platform below.
Read the full transcript here.
From a tooling and an infrastructure perspective, it is really challenging to build deep learning applications, especially if you happen to be outside of a very small number of elite organizations like Google Brain or Facebook AI Research.
Deep learning is fundamentally different from classical machine learning…you are typically training models on custom accelerators, typically GPUs, and part of the reason you need to use GPUs is just the computational requirements of training these deep learning models is just dramatically, dramatically larger than for classical machine learning. So, even with a modern GPU, training a deep learning model might take a week of computation or longer. The amount of computation, as well as the size of the data sets that people are using for deep learning, is really a stark difference.
Deep learning is a reason why server hardware is suddenly a lot more interesting recently than it has been in the past. It’s been quite a challenge to figure out how to integrate some of these custom accelerators into enterprise computing platforms and cluster management systems and make it simple. It’s still a challenge for a lot of teams to manage GPUs in the same way that they’re kind of comfortable managing CPUs. But we’re headed to environment where computation and hardware is likely to be a lot more heterogeneous in the future than it has been in the past. Right now NVIDIA GPUs obviously are very, very widely used for deep learning, but our suspicion is that the hardware is likely to get even more diverse. There’s a lot of companies building custom chips for aspects of deep learning, either training or inference or both.
There is a lot of excitement around tools for deep learning and different software packages to train deep learning models better. A lot of that is centered around PyTorch and TensorFlow, which are really great packages. But I think of those as single user tools. So if you’re a single researcher with a single GPU, tools like PyTorch and TensorFlow help you with the basics, right? How do I write down my model architecture? How do I describe how I’m optimizing the model? How do I describe how I’m computing validation metrics? How do I load data into the model? Those kinds of basic questions.
But there’s a whole bunch of additional challenges that are essential to being successful with deep learning. Everything from: How do I train my model using a larger amount of computation? How do I track all the metrics that are produced by that training process? How do I tune my hyperparameters? How do I share my computational resources with other people on my team? How does my model get from the training environment to production? Once you go outside the scope of those tools - like TensorFlow and PyTorch - today’s tooling often runs into some challenges for deep learning engineers.
There’s actually quite a lot of overlap between what you need as a researcher today to really be successful doing deep learning and what organizations that are deploying deep learning, what their requirements are. As a researcher, you’re not just training a single model one time with a single GPU. Even if you are operating at a smaller scale, being able to move from training your model on one GPU to accessing massive computation when that’s appropriate and being able to do that in a really flexible way, there’s still a lot of value there. Similarly being able to keep track of all metrics of the jobs that you’re training, to organize that information in a structured way is still really valuable, whether you’re a small scale or a large scale. And in our experience talking to a lot of machine learning practitioners, if you’re an individual researcher, you don’t necessarily have the benefit of a platform team or an infrastructure team at your company. And it can be very easy to spend the time that you had budgeted on deep learning essentially managing infrastructure, dealing with the kind of challenges that come with running compute jobs on cloud resources or an on-premise GPU cluster.
Determined is open source and we also build on top of a bunch of open source technologies, tools like TensorFlow, PyTorch, Docker, Kubernetes, and a bunch of others are part of the platform that we built. What differentiates us is that we’ve built a batteries-included deep learning development environment, which makes all those tasks that you need to accomplish as part of training, developing deep learning models as simple as we can and automates stuff that you shouldn’t have to think about. Where it makes sense to reuse an open source tool, we’re happy to do that. Whereas we feel like there’s a big advantage to building something from scratch internally we’ve done that as well.
Our platform enables deep learning engineers to use distributed training, which is to train one deep learning model using many GPUs. And the underlying distributed training engine we use to do that is an open source project called Horovod, originally from Uber, which is a great piece of software. But part of the value that we’re adding is making it much easier to configure Horovod and to use it in a multi-user setting. So with Determined all you need to do is specify the number of GPUs that you want to use for training. And the system takes care of scheduling that job on those GPUs, launching the containers, setting up communication between those containers, configuring Horovod appropriately, running that distributive training job in a fault-tolerant way.
I think that if you compare that to the experience you get using Horovod itself or using other kinds of open source tools, I think there’s a similar consistent pattern which is that in what’s available in the open source, the kind of key technical capability might be present. Tools like Horovod are really great at doing distributed training. But if, as a deep learning engineer, if I need to integrate, five, six, seven of those tools in order to get my job done, then just keeping up with the pace of change of those tools, given how fast the ecosystem is moving, just keeping those tools working well together is practically a full-time job. And it’s very easy to spend a lot of time configuring the networking I need for Horovod, and the distributed file system I’m storing my data in, and the, job schedule that I’m running my tasks with, and all of a sudden you’re just spending a lot of time on these infrastructure questions and system questions, rather than actually developing deep learning models.
Another kind of built in capability that our users find really helpful is that we built hyperparameter search directly into the product, as a really kind of first-class capability. I think it’s interesting that there’s been a lot of research into methods for more advanced hyperparameter search algorithms…But despite a lot of very clever algorithms for efficiently searching hyperparameter spaces, if you look at surveys that have been done of machine learning practitioners, most of them still use very simple methods for hyperparameter tuning, grid search or random search. And I think part of the reason why is that a lot of the time these fancier algorithms are implemented as a separate tool, as a tool that in many cases maybe doesn’t run on a cluster or it makes it hard to run to access massive computation, is often difficult to use the API, and to run your hyperparameter search is pretty complicated. And just that hassle of setting up a new tool, figuring how to run it on a cluster, figuring out how it integrates with the rest of your experiment tracking or your model development workflow is just a pain that often means these fancier algorithms see pretty limited adoption…[In contrast] In Determined…anytime you want to explore a space of possible alternatives, possible model architectures or data augmentation techniques or drop out percentages or what have you, anytime you have a decision like that, you can very easily kick off one of these search operations to explore that space of alternatives in a systematic and efficient way.
All of your deep learning workloads that you’re launching by Determined can run as Kubernetes pods, and for a lot of infrastructure and platform teams, that’s a really comfortable way to run their infrastructure. That being said, using Determined, a lot of that Kubernetes complexity is abstracted away from you. So, as a deep learning engineer, using Determined on top of Kubernetes is no different than using Determined on top of a cloud environment or kind of a bare metal on-premise environment. You don’t need to think about configuring Kubernetes or the details of the Kubernetes resources that your jobs are running on. We give you a very simple, deep learning-focused configuration file that you’re writing, and we really allow you to focus on deep learning rather than on the details of the Kubernetes environment that your workloads happen to run in.
Our engineers have been writing about some really compelling topics over the last few weeks. Read up on:
For more information on getting started with Determined, visit our docs page. We also encourage you to join our community Slack and check out our repository on GitHub. Our next podcast recording will be released on December 2nd, so stay tuned!