November 30, 2020
Kubernetes is one of the most important pieces of software produced in the last decade and one of the most influential open source projects ever. Kubernetes has completely revolutionized how applications are developed and how infrastructure is deployed and managed.
With Kubernetes’ explosive rise, more and more physical hardware is being managed by Kubernetes. This trend has coincided with an explosion in the popularity of deep learning, an extremely computation-demanding technology that can result in a single data scientist occupying dozens of GPUs for weeks at a time. This has led to an obvious development:
This seems great! Give data scientists access to more hardware. The problem is that a bunch of tools out there are actually:
This may sound like the same thing but it’s not — making hardware more accessible is great, but not if you force data scientists to understand Kubernetes first. Kubernetes was built by software engineers, for software engineers.
If you happen to be a Unicorn, congratulations! Use your magic to wield Kubernetes and Deep Learning together and build something beautiful. For everyone else (most of us!), you’ll instead be pretty annoyed that you need to deep dive into computer systems before you can make progress developing new ML models. The crazy thing is, this hassle is completely avoidable! We just need to start developing ML tools built for data scientists, not software engineers.
Let’s take a quick look at Kubeflow to understand what I mean about data science tools that are built for software engineers. Kubeflow started as an adaptation of how Google was running TensorFlow internally, as a tool that allowed TensorFlow to run on Kubernetes. This technology was very impactful, creating a much simpler way to use hardware managed by Kubernetes to do deep learning.
That initial version of Kubeflow is now the Kubeflow component called TFJob. Without TFJob, running TensorFlow on Kubernetes would be miserable — you would need to specify a complex topology of containers, networking, and storage before you could even start writing your ML code. With TFJob, this is simplified, but, crucially, it is not nearly simple enough. To use TFJob, you need to:
Wrap your ML code up neatly in a container. This will be a clunky experience that will require you to package your code and upload it if you want to make changes. Docker is great, but this will slow down your development cycle significantly.
Write a Kubernetes TFJob manifest. This might not sound that intimidating, but for a data scientist not fluent in Kubernetes it can be a daunting task. To do this well, you’ll need to learn a lot about Kubernetes — a far cry from the Python that these scientists are used to. Let’s look at the most simple version of this, from the Kubeflow docs:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: tfjob
namespace: your-user-namespace
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
This configuration file is full of concepts that are foreign to most data scientists. Pods, replicas, sidecars, restart policies, Kubernetes APIs — all of this is confusing, complex, and detracts from our ability to focus on data science.
Learn the Kubernetes CLI. This is minor, but again navigating Kubernetes is not a trivial thing to figure out. Submitting jobs may be relatively straightforward, but seeing results, artifacts, and logs of experiments is unintuitive and clunky.
Let’s revisit where this all came from: an adaptation of the technology that Google used to train DL models. Google (unlike most of the rest of the world) is filled with unicorns — running Deep Learning on Kubernetes is not a big deal! What is best for Google isn’t always best for the rest of the world though, and in this case I’ve personally seen a lot of very skilled data scientists bounce off of Kubeflow because Kubernetes was too far outside of their comfort zone.
If you look at some of the other components of Kubeflow, you’ll find similar philosophies: MPIJob, PyTorchJob, and Katib all expect data scientists to work with Kubernetes concepts and APIs. All of them suffer from the exact same usability issues — most data scientists don’t want to dive into the weeds of how Kubernetes is orchestrating the hardware, they just want an easier way to train their models. They want tools that abstract away foreign concepts and let them communicate ML concepts succinctly.
One fascinating thing about Kubeflow is that some of the components of Kubeflow have clearly figured this out! The best example is Kubeflow Pipelines — the core underlying technology of Kubeflow Pipelines is Argo Workflows which are very similar to TFJob, providing a way to declare workflows in Kubernetes. Kubeflow Pipelines goes the crucial extra step of providing a Domain Specific Language that allows data scientists to write pipelines in Python! The builders of KFP realized that building containers and writing Kubernetes manifests wasn’t how data scientists wanted to interact with their work, so they abstracted away the k8s and made a tool that data scientists love.
Even parts of Kubeflow have realized that we can do better!
Machine learning is exploding in popularity, and as it does ML tooling is frantically trying to keep up. Tools for everything you can imagine in ML are popping up: data versioning tools, experiment tracking tools, model serving tools, and yes even tools to help ML run on Kubernetes. However, the space is still young; although some projects are more popular than others there are no clear winners yet and nobody has yet to establish the perfect set of software to enable and accelerate ML.
What will the winners look like? Some thoughts:
These tenets are a long way of saying something we all know at this point, the best software has the best UX. The only additional advice is to make sure you actually understand the possible users of your software! Know what data scientists like and dislike about their current workflows, enhance what they like and slice out what they don’t like!
At Determined, we’ve worked with plenty of real-world deep learning scientists to understand their pain points and how they work today. Determined accomplishes many of the same goals as a tool like Kubeflow — allowing scientists to build and train deep learning models on Kubernetes (or any hardware, really), but without expecting data scientists to master countless new technologies along the way.
Compare a Determined configuration file to the TFJob configuration above:
description: cifar10_pytorch
hyperparameters:
learning_rate: 1e-4
learning_rate_decay: 1e-6
layer1_dropout: 0.25
layer2_dropout: 0.25
layer3_dropout: 0.5
global_batch_size: 32
records_per_epoch: 50000
searcher:
name: single
metric: validation_error
max_length:
epochs: 32
entrypoint: model_def:CIFARTrial
This configuration accomplishes essentially the same goal — describing an ML training workflow. The big difference is that this configuration is written in the language of data scientists, with complicated infrastructure concepts abstracted away. We speak to users in terms they are comfortable with: hyperparameters, epochs, metrics, etc.
This means that our users can do more with less. Instead of having to learn Kubernetes or configure a cluster of machines to work with Horovod, they simply need to install Determined and describe experiments with their own terms. Determined unlocks incredibly powerful tools like distributed training, hyperparameter search, and experiment tracking, without placing extra burden on the user to understand what is happening behind the scenes. Determined has carefully built powerful abstractions that allow data scientists to focus on science, and not engineering, systems, and infrastructure.
Determined understands data scientists and enables them to accomplish more with less work. To see it in action, start with our quick start guide! Determined is open source, so you can see how we do it in our GitHub repository.
If you have any questions along the way, hop on our community Slack; we’re happy to help out!