The cloud giants have an AI problem

The general perception of cloud computing is that it makes all compute tasks cheaper and easier to manage. No one really questions that premise, since Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure have reduced the costs of most computing down to cents per hour. The cloud provides cost-effective back-end infrastructure whether you’re an individual working on a small app or a large company streaming video content to millions of people. Now, since companies are using their biggest data sets and massive amounts of compute power to build Artificial Intelligence (AI) applications, the cloud giants must be cashing in, right? Wrong. AI bucks the trend when it comes to cloud computing economics, and to make matters worse, the cloud giants are hoarding the technology needed to make AI modeling faster and cheaper.

In order to train modern AI models efficiently, you need specialized hardware, generally in the form of graphics processing units (GPUs). If a company is just getting started on an AI project, a cloud provider could make sense to get its first model running on a GPU; perhaps a small test to see if the model produces an interesting result. But as an organization’s needs grow from one GPU to hundreds or thousands, leasing these resources from the cloud providers becomes nearly 10x as expensive than simply buying and operating these GPUs on-premise. Today, an on-demand instance equipped with a single NVidia V100 GPU on each of the leading cloud providers costs approximately $27,000 per year. Additionally, each service is designed to work best with their own service’s associated AI software to manage the GPUs at additional cost –as much as $10,700 per GPU per year for AWS Sagemaker for example—locking customers into its AI software platform.

The figures below illustrate the disparity between on-premise and cloud pricing. Determined AI’s own detailed research found that the top cloud services for AI are up to 9.4x as expensive as owning and operating your own GPU infrastructure depending on performance and utilization. For the cloud options, we looked at AWS and Google Cloud Platform (GCP) when considering “traditional cloud providers.” We surveyed two hardware performance levels - low and high - based on GPU model. We also considered two utilization assumptions: low (10%) and high (100%). Importantly, we included assumptions for costs of power, cooling, human maintenance and other overheads designed to measure total cost of ownership of AI hardware to make the cloud comparison as apples-to-apples as possible. We also assume that when high utilization is predictable, an organization would choose to use a “reserved” instance type on the cloud providers, locking into an upfront agreement to pay for a full year, rather than pay on an hourly basis. This means sacrificing some of the flexibility that makes the cloud attractive.

  Low end performance, 10% Utilization Low end performance, 100% Utilization High-End Performance, 10% Utilization High-End Performance, 100% Utilization
AWS $0.184 $0.93 $3.07 $1.94
GCP $1.32 $.0.84 $2.90 $1.95
Pre-Built DL Server $1.56 $0.16 $3.42 $0.34
Built-from-scratch DL Server $1.17 $0.12 $2.57 $0.26

Simply put, what we found was that the major cloud providers—where most universities, businesses, and scientific organizations would likely look to start an AI initiative—were far more expensive than owning and operating your own on-premise models and infrastructure. There are also other factors including security and data gravity that favor on-premise infrastructure as well.

Recently, WIRED reported that OpenAI spent $8 million on cloud computing to create its first Dota-playing bot. In 2018, OpenAI disclosed that it used more than 120,000 processors rented from Google Cloud, playing the equivalent of 45,000 years of the popular battle arena game Dota against versions of itself, this time disclosing only that it required “millions of dollars.” Sam Altman, CEO of OpenAI, noted that he isn’t sure if OpenAI will continue to utilize cloud services, saying he remains open to buying and designing AI hardware. And, as the OpenAI team has noted, the amount of compute needed to achieve state-of-the-art results in AI tasks is doubling every 3-4 months.

The cloud vs on-premise decision isn’t just about costs though, it’s also about access. I mentioned above that the cloud giants are hoarding the technology to make AI modeling faster and cheaper. I’m talking about the fact that their internal AI tooling is more mature and advanced than their cloud-based AI software, thus allowing them to internally train quality models faster and more reliably than it would take organizations relying on cloud-based software. Can you imagine if the top companies in the world, the ones that can afford to invest billions of dollars in infrastructure and engineering talent, were the only ones that could effectively build and power AI apps? It seems crazy, but this is more or less what is happening in AI today.

A couple months ago, the New York Times reported that Google’s DeepMind, an AI lab, won a top biochemical research competition over scientists that have dedicated their lives to the field. Not only did DeepMind win, but one Harvard professor reported they were “way out ahead” of the field. Think about what this means—scientists that have spent their lives studying and publishing research on biochemistry fell far behind a Deep Learning (DL) model built by engineers in a technology lab. To me, it is clear from this example that the future of scientific research that encapsulates critical issues like global warming, cancer and brain diseases, will be dominated by the combination of human experts and AI/DL systems. So too will the next generation of language translators, voice assistants, workplace productivity applications, computer vision models (that enable things like driverless vehicles), and more.

The problem, again, is the cost of doing business hinders scientists and companies that don’t have the resources that Amazon, Google, Facebook, Microsoft or Apple have. Today we are in the dark ages of AI infrastructure. Over the next several years, the cloud giants will continue to push AI features and tout the number of customers running AI programs on their infrastructure, but the truth will remain that the economics are not on the side of the cloud giants and they’ll continue to provide AI software, just not the software that they themselves are using to be most productive. So, more work is needed among the open-source community and innovative startups to bring better alternatives for AI/DL development to the Global 2000 and beyond. The golden age of AI awaits.