Announcing the future of AI infrastructure

By Neil Conway, Evan Sparks, Ameet Talwalkar

March 13, 2019

We are entering the golden age of artificial intelligence. Model-driven, statistical AI has already been responsible for breakthroughs on applications such as computer vision, speech recognition and machine translation, with countless more use-cases on the horizon. But if AI is the lynchpin to a new era of innovation, why does the infrastructure it’s built upon feel trapped in the 20th century? Worse, why is advanced AI tooling locked within the walls of a handful of multi-billion-dollar tech companies, inaccessible to anyone else?

Today we’re formally introducing Determined AI, a company that exists to let AI engineers everywhere focus on models, not infrastructure. Determined AI is backed by GV (formerly Google Ventures), Amplify Partners, CRV, Haystack, SV Angel, The House, and Specialized Types.

We are in the dark age of AI infrastructure

AI, and specifically deep learning (DL), is becoming the most important computational workload for businesses and industries of all kinds. For example, DL has dramatically advanced the performance of autonomous vehicles at Waymo; DL powers Siri, Apple’s personal assistant that communicates via speech synthesis; and it has revolutionized Facebook’s ability to understand user sentiment. These applications, pioneered by a handful of cutting-edge technology firms, speak to the power of DL, but also the need for it to be accessible to a much wider range of businesses and developers.

These firms have a key advantage when it comes to exploiting the power of deep learning: they have built sophisticated AI-native infrastructure for internal use. Everyone else has to make do with existing tools, which are woefully inadequate for AI-driven application development, as this paradigm is radically different from conventional software development. Indeed, the vast majority of engineers today are forced to cobble together tools that speak non-standard protocols, with non-standard file formats on top of ad-hoc, multi-step workflows. These point solutions lead to enormous complexity and huge amounts of time and productivity lost to inefficiencies. As a result, organizations that depend on advances in AI – like anyone working with vision, speech, or natural language – risk being held back without a radically new approach to AI infrastructure.

Our vision

At Determined AI, our goal is to power deep learning at the speed of thought. We build specialized software that directly addresses the challenges DL developers struggle with every day. Here’s what engineers can expect from Determined AI:

Fast: In traditional software development, users change a file and recompile nearly instantly, but in DL, generating a new high-quality model can take hundreds of thousands of GPU hours. Our DL-aware scheduling system allows for cluster sharing, fault tolerance, and workload elasticity with sub-second latency.
Seamless: DL researchers today are forced to stitch together narrow technical tools on top of inefficient, generic infrastructure. The results are predictably painful: DL modelers need to worry about tasks like setting up parameter servers, configuring MPI, and understanding Kubernetes just to get their jobs done. We started from the problems faced by DL researchers today and worked backward to deliver a seamless, integrated environment that is dramatically easier to use than traditional tools.
Fully interoperable: Determined AI liberates your DL investment from the risk of cloud or hardware lock-in. You get a best-of-breed DL solution that runs on any hardware you want, works with all the popular DL frameworks, and integrates with your existing enterprise software environment.

Building the first AI-native infrastructure platform

Given how important deep learning has become – and how different DL is from traditional computational tasks – it is time to rethink how we’re building AI infrastructure from the ground up.

To achieve this, we started by assembling the right people. We believe that building AI-native infrastructure requires a team with a rare combination of skills: a deep understanding of modern AI workloads, but also expertise in building large-scale data-intensive systems. We’re fortunate that our team includes world-leading experts in both domains. Creating an environment where these two groups of people can collaborate and co-design the system together was a key first step.

Next, we adopted two key design principles:

Holistic design: Traditional DL tools typically focus on solving a single narrow problem (e.g., distributed training of a single neural network). This forces DL researchers to integrate a handful of tools together to get their work done, often building out extensive ad-hoc plumbing along the way. In contrast, we’ve thought carefully about the key high-level tasks that DL researchers need to perform – such as training a single model, hyperparameter tuning, batch inference, experimentation, collaboration, reproducibility, and deployment in constrained environments – and built first-class support for those workflows directly into our software.
Specialization: Deep learning requires applying enormous computational resources to large data sets. Although systems like Spark and Hadoop work well for more traditional analytic tasks, they cannot keep up with the unique challenges posed by deep learning. AI-native infrastructure must support efficient GPU and TPU access, high performance networking, and seamless integration with deep learning toolkits like TensorFlow and PyTorch.

Combining these principles – to build an integrated platform that is specialized for the unique challenges of deep learning – yields massive improvements to both performance and usability. For example, many companies employ cluster schedulers like Kubernetes, Mesos, or YARN, which can be used to run deep learning workloads. However, traditional cluster schedulers and leading DL frameworks have been designed independently, which results in both poor performance and usability. In contrast, we have developed a specialized GPU scheduler that natively understands key deep learning workloads, including distributed training, hyperparameter tuning, and batch inference. This yields dramatically better performance: for example, our software performs hyperparameter tuning more than 50x faster than conventional methods! Moreover, DL workloads on our platform automatically support seamless fault tolerance, dynamic elasticity, and can scale from on-premise resources to cloud capacity on demand.

The road ahead

Although we are announcing the company today, our software has been running on production GPUs for more than a year. Our customers tell us that we have already saved them hundreds of engineering hours per person per year and hundreds of thousands of GPU-hours across their teams. However, there is still much work to be done to reinvent the software stack for the AI-native era ahead, and we’re excited to build that future together with our customers.

There are many reasons to be optimistic about the enormous potential of AI, but to realize that potential, AI development must be broadly accessible in the same way that software development is accessible today. Anyone should be able to apply AI to problems that they’re working to solve, and we’re excited to be a part of that journey.

Announcing the future of AI infrastructure

We are in the dark age of AI infrastructure

Our vision

Building the first AI-native infrastructure platform

The road ahead

Recent Posts

Finding the best LoRA parameters

Summer '24 Conference Recap

How does Video Generation work?