ALBERT on Determined: Distributed Training with Spot Instances

We hosted a lunch-and-learn session on April 13 where we demonstrated the fundamentals of distributed training using ALBERT on Determined. In the coming weeks, we’ll be hosting another interactive session (and providing lunch for attendees!) so be sure to join our Determined AI meetup group!

One of the big headaches in deep learning is that models take forever to train. As an ML engineer, waiting hours or days for training to complete makes iteratively improving your model a slow and frustrating process. You can speed up model training by using more GPUs, but this raises two challenges:

  1. Distributed training is a hassle because it requires changing your model code and dealing with DevOps headaches like server management, cluster scheduling, networking, etc.
  2. Using many GPUs at once can quickly cause your training costs to skyrocket, especially when using on-demand cloud GPUs.

In this blog post, we show how to accelerate fine-tuning the ALBERT language model1 while also reducing costs by using Determined’s built-in support for distributed training with AWS spot instances. Originally, ALBERT took over 36 hours to train on a single V100 GPU and cost $112 on AWS. With distributed training and spot instances, training the model using 64 V100 GPUs took only 48 minutes and cost only $47! That’s both a 46x performance improvement and a 58% reduction in cost!

Best of all, realizing these performance gains and cost reductions required nothing more than changing a few configuration settings. As we detail below, switching to distributed training and leveraging spot instances in Determined can be done without changing your model code, without needing to understand the details of using spot instances, and with no manual server wrangling required.

Race

ALBERT

For this benchmark, we fine-tuned the most recent, high-accuracy version of ALBERT (albert-xxlarge-v2) on the SQuAD 2.0 dataset, using the Hugging Face implementation. We chose a fine-tuning task rather than training from scratch because it takes less time and because training from scratch is often too expensive to be practical. However, the techniques here would be equally applicable to training from scratch.

For training, we chose V100 GPUs with 16GB of memory because they are fast but they are not quite the most powerful GPU available on AWS, so they are reliably available on the AWS spot market.2

Speeding up ALBERT

Baseline Performance

As the baseline for this benchmark, we used Hugging Face’s SQuAD 2.0 example code to fine-tune ALBERT on a single V100 GPU for 2 epochs. On a single V100, the throughput of ALBERT training is 2 examples per second, and it took 36.7 hours to train for 2 epochs.

That’s a long time, so let’s see how we can speed up training by using more GPUs.

Distributed Training And Batch Sizes

Before talking about what we did with ALBERT, let’s briefly go over our general approach to speeding up training with more GPUs.

The biggest challenge with distributed training3 is that as you increase the batch size you are training with, the accuracy can start to decline. However, there are some steps you can follow to scale out without losing accuracy. How far you can scale out before losing accuracy varies by model.

The general rule of thumb when increasing the number of GPUs is to keep the per-GPU batch size fixed and increase the global batch size (sometimes called weak scaling). This helps to ensure that each individual GPU is as busy as possible.

As your global batch size increases, you will need to adjust your learning rate to maintain the same accuracy. The recommended technique for this is very simple - increase your learning rate proportionally with your batch size (e.g., if you double the global batch size, double the learning rate). This is known as the Linear Scaling Rule, introduced in this Facebook paper. As the global batch size gets very large, the Linear Scaling Rule breaks down, but it often gives you a learning rate that is just slightly higher than what you need.

Eventually, your batch size grows enough that you hit the “large batch size convergence problem”, where the model won’t converge no matter what learning rate you use. At that point, you need to start using large-batch-size-specific techniques to converge.

If you hit the large-batch convergence problem and want to continue speeding up training, the easiest way to incorporate more GPUs is to hold the global batch size constant and start decreasing the per-GPU batch size as you add more GPUs. This may reduce how busy your GPUs are, but that loss in efficiency is often offset by the larger amount of compute.

RAdam

One very powerful and easy-to-use technique is the RAdam optimizer. This optimizer reliably maintains high accuracy while being more robust to the learning rate you choose. We highly recommend trying RAdam if you are doing distributed training.

ALBERT In 48 Minutes

In this benchmark, we were able to scale up to 64 GPUs. We swapped out Adam for RAdam and ran a number of experiments where we increased the global batch size while using the Linear Scaling Rule. We found that the model did not converge when using a global batch size larger than 256.

We decided to focus on training on 64 GPUs with a global batch size of 256. Our initial experiment used the Linear Scaling Rule to set the LR, but that gave us lower-than-desired accuracy. We then slightly reduced the learning rate and that gave us the desired accuracy.

GPUs Throughput (examples/s) Exact Match F1 Time-to-train 2 epochs
1 2 85.76 88.87 36.7 hours
8 15.8 85.76 88.87 4.7 hours
64 92.75 86.24 89.06 48 minutes

When we train ALBERT on 64 V100 GPUs, the throughput is 92.75 examples per second, 46 times faster than the throughput on a single GPU. Training for two epochs takes 48 minutes, as opposed to 36.7 hours with a single V100 GPU. Being able to train your model multiple times per day instead of a couple of times per week made life much easier when doing this benchmarking.

Saving Money With Spot Instances

You’ll notice that we are using 64 times as many GPUs but training is only 46 times faster. Normally that would make using 64 GPUs more expensive than using a single GPU, introducing a cost/speed tradeoff to scaling with more GPUs. However, Determined makes it easy to leverage AWS spot instances4 to also bring the cost of training down dramatically.

Because spot instances are unreliable, you need to build fault-tolerant infrastructure to use them effectively. This means writing code to create checkpoints, code to save and restore RNG state, code to resume your data pipeline at the same place in the dataset, etc. At first glance, it doesn’t seem all that challenging, but once you dig into the details, there is a surprising amount to consider. Our CTO, Neil Conway, had a Twitter thread with more detail on the topic.

These engineering obstacles often prevent deep learning teams from using spot instances, but with Determined it is trivial to take advantage of spot instances because Determined provides fault-tolerance for all deep learning training workloads automatically.

Configuring Spot Instances with Determined

To use spot instances with Determined, we can take advantage of the support for resource pools that was introduced in Determined 0.14. A resource pool is a collection of identical computational resources. For example, you might configure one pool with expensive GPU instances, a second pool with cheaper GPU instances, and a third pool with CPU instances. Users can select the pool to use when they submit a job.

We can use resource pools to simplify training with spot instances. For example, here is a fragment of a master configuration file that configures a resource pool with up to 64 p3.16xlarge instances:

   resource_pools:
     - pool_name: aws-spot-p3-16xlarge
       provider:
         type: aws
         instance_type: p3.16xlarge
         spot: true
         max_instances: 64
         # ...

When launching a task, you can configure which resource pool the task should be assigned to via the resources.resource_pool configuration variable. For example, to launch an experiment in the aws-spot-p3-16xlarge resource pool defined above, include this fragment in your experiment config file:

resources:
  resource_pool: aws-spot-p3-16xlarge

Spot Instance Availability

Because spot instances use “excess” EC2 capacity, some instance types might have limited availability as spot instances in certain regions. This can be particularly true of instance types equipped with the latest generation of NVIDIA GPUs, for example. For this benchmark, we used p3.16xlarge instances rather than p3dn.24xlarge instances, partially for this region: although each GPU on a p3.16xlarge only has 16GB of GPU memory (compared to 32 GB per GPU on p3dn.24xlarge), we found that p3.16xlarges were far more likely to be available as spot instances. The exact savings you get from spot instances can vary, but we found that p3.16xlarge spot instances in us-east-1 were consistently about 70% cheaper than on-demand instances.

Results

Here are the final results:

Run Training Time (2 epochs) Cost
1 GPU, on-demand 36.7 hours $112.21
64 GPUs, spot 48 minutes $47

These results show that with the right software and some cleverness when scaling to large batch sizes, it is possible to get dramatically faster training, high model accuracy, and yet reduced cost, all at the same time! Just as important, moving from single GPU training to distributed training and from on-demand to spot instances, doesn’t require changing your workflow, your infrastructure, or your model code. Except for switching to RAdam, everything else we discussed in this blog post can be accomplished by tweaking a few configuration settings when launching your training job with Determined!

To learn more about Determined and how it can help make your training easier, faster and cheaper, check out our GitHub repo or hop on our community Slack.


  1. A Lite BERT, a near state-of-the-art language model that focuses on reducing the number of parameters and computational cost. 

  2. AWS spot instances are instances that aren’t being used by anyone else in AWS and so can be rented for a greatly reduced price. The downside is that AWS might reclaim the spot instances you are using if someone else is willing to pay the standard (on-demand) price. 

  3. Actually the biggest problem is probably setting up the infrastructure correctly to get distributed training working, but on Determined all of that is taken care of for you. 

  4. GCP supports preemptible instances, which are similar to spot instances on AWS. Determined supports both spots and preemptibles.