July 25, 2018
To maximize the value of your deep learning hardware, you’ll need to invest in software infrastructure. Setting up a cluster manager is an essential first step in this process, but it’s not the end of the story.
Imagine you are the manager of your company’s core ML team. Given the excitement around deep learning, you’ve encouraged one of your engineers to start experimenting with it. After a few weeks of exploration, initial results suggest that a deep network leads to significant improvement over what your traditional ML methods have been able to achieve. Awesome!
Given these promising findings, you move quickly to scale up your investment: you hire more DL engineers and buy your own GPU cluster. (Why go on-prem? Read our post on why we see this as an emergent trend in deep learning.) Between getting your new recruits ramped up and setting up your newly purchased hardware, you’ve got a lot to do!
In the midst of this transition, it can be hard to think about scaling your ML software infrastructure. Furthermore, you might not see the need—Keras and PyTorch worked just fine for your initial experiments. What you don’t realize, however, is that scaling your DL cluster will introduce entirely new kinds of challenges. Here’s one:
How do you share your new GPU hardware among a growing team of DL engineers?
Surprisingly, even sophisticated teams we talk to often adopt quite low-tech solutions to this challenge, such as
Unfortunately, while these manual systems are easy to set up, they are also
Fortunately, the challenge of running heterogeneous workloads on a cluster of shared compute resources is not a new one. Cluster management software such as Kubernetes, DC/OS (built on top of Apache Mesos), and Slurm allow you to treat a collection of machines as a unified pool of hardware resources. This makes running workloads on your cluster dramatically easier: users only need to think about launching new containers, not about managing individual machines. Cluster management software often provides ancillary services that simplify distributed computing, such as fault tolerance, networking, and security. Historically, these frameworks have focused on managing CPU, memory, and disk resources, but recently all three frameworks have been updated to include basic support for GPU resources.
Spinning up a new cluster manager can be a little involved, especially if no one on your team has a systems background. However, compared to the ongoing headaches of manual GPU management, you expect that adopting one of these systems will pay off very quickly, and so you decide to move forward with Kubernetes.
With your functioning Kubernetes cluster, you expect your resource management problems to be fully solved. The early signs are promising. Your engineers no longer have to SSH into a particular machine to start a job; they can instead submit a containerized deep learning training job to the cluster, specifying the number of GPUs they’ll need.
You sit back and wait for team productivity to skyrocket. But in subsequent weeks, you see only a modest uptick in the number of models trained. Why?
Unfortunately, while the generic design of cluster management software makes it incredibly powerful, it also makes it blind to the unique properties of deep learning workloads. As a result, it lacks native support for many of your team’s crucial needs. These include distributed training, experiment tracking, metadata management, and integrated hyperparameter tuning.
In our next post, we’ll describe these needs in more detail to better make the case for why a traditional cluster manager is an insufficient DL infrastructure solution. Then, we’ll outline how Determined provides exactly the missing pieces needed to take your team to the next level.