Data Scientists Don't Care About Kubernetes

By Neil Conway, David Hershey

November 30, 2020

Kubernetes is one of the most important pieces of software produced in the last decade and one of the most influential open source projects ever. Kubernetes has completely revolutionized how applications are developed and how infrastructure is deployed and managed.

With Kubernetes’ explosive rise, more and more physical hardware is being managed by Kubernetes. This trend has coincided with an explosion in the popularity of deep learning, an extremely computation-demanding technology that can result in a single data scientist occupying dozens of GPUs for weeks at a time. This has led to an obvious development:

Tools that allow data scientists to train models using hardware that is managed by Kubernetes.

This seems great! Give data scientists access to more hardware. The problem is that a bunch of tools out there are actually:

Tools that require data scientists to use Kubernetes to train models.

This may sound like the same thing but it’s not — making hardware more accessible is great, but not if you force data scientists to understand Kubernetes first. Kubernetes was built by software engineers, for software engineers.

Most data scientists are not software engineers, and most software engineers are not data scientists.

If you happen to be a Unicorn, congratulations! Use your magic to wield Kubernetes and Deep Learning together and build something beautiful. For everyone else (most of us!), you’ll instead be pretty annoyed that you need to deep dive into computer systems before you can make progress developing new ML models. The crazy thing is, this hassle is completely avoidable! We just need to start developing ML tools built for data scientists, not software engineers.

ML Tooling Gone Wrong

Let’s take a quick look at Kubeflow to understand what I mean about data science tools that are built for software engineers. Kubeflow started as an adaptation of how Google was running TensorFlow internally, as a tool that allowed TensorFlow to run on Kubernetes. This technology was very impactful, creating a much simpler way to use hardware managed by Kubernetes to do deep learning.

That initial version of Kubeflow is now the Kubeflow component called TFJob. Without TFJob, running TensorFlow on Kubernetes would be miserable — you would need to specify a complex topology of containers, networking, and storage before you could even start writing your ML code. With TFJob, this is simplified, but, crucially, it is not nearly simple enough. To use TFJob, you need to:

Wrap your ML code up neatly in a container. This will be a clunky experience that will require you to package your code and upload it if you want to make changes. Docker is great, but this will slow down your development cycle significantly.

Write a Kubernetes TFJob manifest. This might not sound that intimidating, but for a data scientist not fluent in Kubernetes it can be a daunting task. To do this well, you’ll need to learn a lot about Kubernetes — a far cry from the Python that these scientists are used to. Let’s look at the most simple version of this, from the Kubeflow docs:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/your-project/your-image
            command:
              - python
              - -m
              - trainer.task
              - --batch_size=32
              - --training_steps=1000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/your-project/your-image
            command:
              - python
              - -m
              - trainer.task
              - --batch_size=32
              - --training_steps=1000

This configuration file is full of concepts that are foreign to most data scientists. Pods, replicas, sidecars, restart policies, Kubernetes APIs — all of this is confusing, complex, and detracts from our ability to focus on data science.

Learn the Kubernetes CLI. This is minor, but again navigating Kubernetes is not a trivial thing to figure out. Submitting jobs may be relatively straightforward, but seeing results, artifacts, and logs of experiments is unintuitive and clunky.

Let’s revisit where this all came from: an adaptation of the technology that Google used to train DL models. Google (unlike most of the rest of the world) is filled with unicorns — running Deep Learning on Kubernetes is not a big deal! What is best for Google isn’t always best for the rest of the world though, and in this case I’ve personally seen a lot of very skilled data scientists bounce off of Kubeflow because Kubernetes was too far outside of their comfort zone.

If you look at some of the other components of Kubeflow, you’ll find similar philosophies: MPIJob, PyTorchJob, and Katib all expect data scientists to work with Kubernetes concepts and APIs. All of them suffer from the exact same usability issues — most data scientists don’t want to dive into the weeds of how Kubernetes is orchestrating the hardware, they just want an easier way to train their models. They want tools that abstract away foreign concepts and let them communicate ML concepts succinctly.

One fascinating thing about Kubeflow is that some of the components of Kubeflow have clearly figured this out! The best example is Kubeflow Pipelines — the core underlying technology of Kubeflow Pipelines is Argo Workflows which are very similar to TFJob, providing a way to declare workflows in Kubernetes. Kubeflow Pipelines goes the crucial extra step of providing a Domain Specific Language that allows data scientists to write pipelines in Python! The builders of KFP realized that building containers and writing Kubernetes manifests wasn’t how data scientists wanted to interact with their work, so they abstracted away the k8s and made a tool that data scientists love.

Even parts of Kubeflow have realized that we can do better!

Tenets of Good Data Science Tools

Machine learning is exploding in popularity, and as it does ML tooling is frantically trying to keep up. Tools for everything you can imagine in ML are popping up: data versioning tools, experiment tracking tools, model serving tools, and yes even tools to help ML run on Kubernetes. However, the space is still young; although some projects are more popular than others there are no clear winners yet and nobody has yet to establish the perfect set of software to enable and accelerate ML.

What will the winners look like? Some thoughts:

ML tools should allow data scientists to accomplish more with less work. The number one opportunity for this is to separate the engineering work that data scientists are currently overwhelmed with from the science that they love.
We need abstractions that make it possible to perform high-performance data science on complex, modern infrastructure without needing to be a systems expert. Designing the right abstractions is not easy but it is crucially important to making modern ML more accessible, more convenient, and more cost-effective.
Most data scientists aren’t unicorns. As such, avoid thinking about data scientists as software engineers and build tools that allow ML people to build, train, and deploy ML models without needing to become devops experts too. There are great tools out there that enable production grade ML without pushing that burden onto data scientists.

These tenets are a long way of saying something we all know at this point, the best software has the best UX. The only additional advice is to make sure you actually understand the possible users of your software! Know what data scientists like and dislike about their current workflows, enhance what they like and slice out what they don’t like!

An Obvious Shill for Determined

At Determined, we’ve worked with plenty of real-world deep learning scientists to understand their pain points and how they work today. Determined accomplishes many of the same goals as a tool like Kubeflow — allowing scientists to build and train deep learning models on Kubernetes (or any hardware, really), but without expecting data scientists to master countless new technologies along the way.

Compare a Determined configuration file to the TFJob configuration above:

description: cifar10_pytorch
hyperparameters:
  learning_rate: 1e-4
  learning_rate_decay: 1e-6
  layer1_dropout: 0.25
  layer2_dropout: 0.25
  layer3_dropout: 0.5
  global_batch_size: 32
records_per_epoch: 50000
searcher:
  name: single
  metric: validation_error
  max_length:
    epochs: 32
entrypoint: model_def:CIFARTrial

This configuration accomplishes essentially the same goal — describing an ML training workflow. The big difference is that this configuration is written in the language of data scientists, with complicated infrastructure concepts abstracted away. We speak to users in terms they are comfortable with: hyperparameters, epochs, metrics, etc.

This means that our users can do more with less. Instead of having to learn Kubernetes or configure a cluster of machines to work with Horovod, they simply need to install Determined and describe experiments with their own terms. Determined unlocks incredibly powerful tools like distributed training, hyperparameter search, and experiment tracking, without placing extra burden on the user to understand what is happening behind the scenes. Determined has carefully built powerful abstractions that allow data scientists to focus on science, and not engineering, systems, and infrastructure.

Determined understands data scientists and enables them to accomplish more with less work. To see it in action, start with our quick start guide! Determined is open source, so you can see how we do it in our GitHub repository.

If you have any questions along the way, hop on our community Slack; we’re happy to help out!

Data Scientists Don't Care About Kubernetes

ML Tooling Gone Wrong

Tenets of Good Data Science Tools

An Obvious Shill for Determined

Recent Posts

Finding the best LoRA parameters

Summer '24 Conference Recap

How does Video Generation work?