Faster NLP with Deep Learning: Distributed Training

In this 2-part blog series, we demonstrate how to accelerate NLP deep learning model training across multiple GPUs, taking BERT for SQuAD1 model training time down from nearly 7 hours to only 30 minutes!

In this post, we motivate the need for deep learning and distributed training in NLP, highlight the factors that matter most for effective distributed training, and provide empirical results from a series of BERT fine-tuning experiments run in Determined. In part 2, we’ll outline techniques for optimizing distributed data loading and model evaluation for the same NLP task.

The Seeds of Deep Learning for NLP

Neural language models like BERT, XLNet, and GPT-3 picked up headline-grabbing steam over the last couple of years. Why is this?

The machine learning algorithms for common NLP tasks like entity recognition and question answering typically require feature vectors as input. Natural language inputs are represented as vectors, and from this point we train models to classify text, cluster documents, answer questions, and the like.2

A considerable amount of research over the past decade has demonstrated the importance of language modeling as a driver for the ultimate quality of the downstream NLP task models.3 The statistical language models of last century like bag-of-words and tf-idf effectively capped the state of the art well below “earth-shattering” for many NLP tasks, particularly the AI-hard NLP tasks that captivate us. I remember first playing with the Babel Fish machine translation engine in the late 1990s and concluding that Skynet was a long way away — Babel Fish often felt more like a word-by-word dictionary lookup engine than a language understanding and translation AI.

The aforementioned cap has since lifted (arguably disappeared) for many NLP tasks, and we constantly experience the benefits firsthand when we search the web or talk to Alexa. NLP advancements exploded in 2018 as a new class of language models arrived: ELMo based on LSTMs, followed by BERT based on the transformer architecture.4 The New Idea of these language models is in their dependency on word context; whereas previous non-contextual word embeddings like word2vec and GloVe encode a given word into exactly one vector, the Sesame Street models are designed to capture language nuance in ways that non-contextual word embeddings can’t by definition. (Can you spot an example in this sentence that would be problematic for word2vec, or are you seeing spots?)

Neural Language Model Training Cost

State-of-the-art language models can be astronomically expensive to train. Models like BERT and XLNet learn 100s of millions of parameters — the compute cost to train many neural language models from scratch is prohibitive for most organizations. For example, XLNet required training for 5.5 days on 512 TPU v3’s — it would cost more than $500,000 to replicate this on GCP!5 Language model complexity and training cost is only increasing — OpenAI’s GPT-3 model, introduced in May 2020, learns 175 billion parameters and is estimated to have cost over $12M to train.

Transfer Learning in NLP

An example sequential transfer learning workflow for NLP. Often costly pretraining produces a generally useful language model that is adapted to solve a specific downstream NLP task. Source for model images: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Thanks to the research community and the organizations backing language model research, high-quality pretrained models are freely available for anyone to use.6 Not only are model checkpoints available, but the broader ML library ecosystem makes it straightforward to use these pretrained models in your framework of choice with just a few lines of code. As a result, the aforementioned sequential transfer learning approach to NLP tasks is easy to implement on a reasonable budget. In our case, we trained BERT for SQuAD to produce a model that can extract answers to questions about a given passage. Each experiment required under $20 of compute cost to train on GCP, and we achieve high quality F1 scores in the 87-89% range. For comparison’s sake, human performance is 86.8%, and state of the art is 93%.7

Baseline: Single GPU Training

Let’s start by training a BERT model for SQuAD on a single GPU. In this PyTorch example, we use Hugging Face’s Transformers library’s pretrained model, featurization and data loading methods, and reference hyperparameters. Training for about 2 epochs on a single GPU yields an F1 score of 88.4% on the validation dataset. Not bad at all… but it took well over 7 hours to train! That’s a long time to wait for results, and this is only one well-trodden benchmark NLP task with fixed hyperparameters. If you or your team need to train on more data, tune hyperparameters, train models for other NLP tasks, or develop and debug models for custom downstream tasks, spending a day or longer to train each model would grind progress to a crawl.

Applying Distributed Training

In the remainder of this post, we show how to use Determined’s distributed training features to perform the same training task about 14 times faster, while achieving the same model accuracy! In the process, we’ll surface the key factors affecting scaling performance and offer practical guidance for effective distributed training.

Thanks to Determined’s standardized model paradigm, training a model on multiple GPUs requires no code changes. Every result that follows, whether training with 2 GPUs on a single machine or 16 GPUs across multiple machines, only required a configuration change.

Let’s see how we can train the same model above in a lunch break rather than a full workday, by utilizing distributed training on Determined. In this blog, we focus on optimizing the training loop. In part 2, we will focus on other components of the training workflow that impact overall runtime, namely data loading and validation. To that end, the following wall clock times are net of data loading and validation.8

Initial Scaling Results

The following graph and table show wall clock training time as we scale from 1 to 8 NVIDIA K80 GPUs,9 all on a single machine on GCP. For these experiments, we fixed both the per-GPU batch size and the total number of epochs to train.

Training Time vs. GPUs

# GPUs Training Time (minutes) Per-GPU Scaling Efficiency
1 399 1.00
2 214 0.93
4 118 0.85
8 61 0.82

This is a decent initial result — we aren’t achieving linear scaling, but our scaling efficiency remains above 80%, and also we managed to get wall clock training time for the BERT SQuAD model down to an hour. We achieved an F1 score of 87-88% on the validation dataset for all four experiments.

Doing Better: Gradient Aggregation and RAdam

In our results above, there is a clear downward trend in scaling efficiency as we increase the degree of training parallelism. What’s going on there? Maybe we can live with the results at this scale, but what if we had a much heavier-weight training workload that we wanted to distribute across 64 GPUs? The trend is not playing in our favor.

A natural theory for what might be happening is that training is communication-bound. To distribute training over 8 GPUs, Determined divides the training dataset into 8 shards, independently trains 8 models (one per GPU) for a minibatch, and then aggregates and communicates gradient updates so that all 8 copies of the model have the same weights. Then we repeat this process of forward & backward pass followed by gradient aggregation, for every batch.10 As we discuss in our blog here and in our product documentation here, communication-reducing techniques like large batch training can help to reduce wall clock training time. Such techniques are particularly promising when GPUs are connected via PCI-E as opposed to higher-bandwidth (yet more expensive) NVLink.11

Let’s see if this theory plays out for our model. With Determined, you can adjust your per-slot batch size with a configuration change, though bear in mind that even small increases can quickly exhaust GPU memory because the intermediate state needed for the forward and backward pass scale with the size of the model and are linearly dependent on the batch size. Instead, we kept GPU memory requirements moderate in order to run on lower cost NVIDIA K80s. We accomplished this by leveraging a Determined distributed training optimization that reduces both inter-GPU communication and per-GPU memory requirements. This aggregation frequency configuration option specifies the number of forward and backward passes to perform before communicating gradient updates across GPUs. In the following table, we see that increasing our aggregation frequency definitely helps runtime:

Aggregation Frequency Wall Clock Time Savings (minutes) Validation F1
2 14 87.3
10 24 85
100 31 52.1

That’s great — if we can cut 31 minutes off of training time, our per-GPU scaling efficiency goes from 0.93 to a perfect 1. I left something out though — for the 3 rows above, our validation dataset F1 scores are 87.3%, 85%, and (gulp) 52.1%, respectively.

So close, and yet so far. What is happening here? Large batch stochastic gradient descent doesn’t always “just work”; in particular, it’s difficult to find the learning rate sweet spot in this setting. Sometimes, the Linear Scaling Rule works, where if we multiple the batch size by k, we also multiply the (previously tuned) learning rate by k. In our case, using the AdamW optimizer, linear scaling did not help at all; in fact, our F1 scores were even worse when applying the Linear Scaling Rule. Learning rate warmup strategies can also help, where smaller learning rates are used early in large-batch training. In our experience, though, the RAdam optimizer is a better bet that performs robustly for large batch training without the need for extensive learning rate tuning. When we swap in this RAdam implementation, our results look much better:

Aggregation Frequency Wall Clock Time Savings (minutes) Validation F1
10 19 87.7
100 21 83.8

In this case, we can see that even a fairly aggressive aggregation frequency of 10 maintains validation performance and gives us perfect training scaling in going from 1 to 2 GPUs. The result for an aggregation frequency of 100 is included to show that, even though wall clock time speedup has reached the point of diminishing returns, RAdam is robust enough to perform reasonably well even in extreme circumstances. Realistically, we’d recommend testing aggregation frequencies under 16 to improve wall clock training time while not sacrificing model performance.

Supercharged Multi-Machine Training

We have come a long way, employing Determined’s distributed training capability to reduce BERT for SQuAD training time from nearly 7 hours on a single GPU to under an hour on 8 GPUs attached to a single machine.

What can we do to further reduce training time, especially given the fact that 8 GPUs is the max allowed for a single instance on GCP? With Determined, model developers can scale out to multi-machine training just as easily as multi-GPU training on a single machine. This is particularly valuable when baseline single GPU training time takes days to weeks.

In the case of training BERT for SQuAD, when scaling out training to 16 GPUs across two instances, we reduced training time even further to 30 minutes. Compared to our first single-GPU training job, we achieve per-GPU scaling efficiency of 0.83.

It’s worth noting that, as we scaled out to multiple machines, we became even more communication-bound, to the point that training time on 16 GPUs would actually be greater than the training time on 8 GPUs on a single machine if we were to hold the effective batch size constant. By increasing the aggregation frequency to just 2, and using the RAdam optimizer, we scale much better and still achieve an F1 score of 87 on the validation dataset.

What’s Next?

In this post, we used Determined’s support for distributed training to optimize BERT for SQuAD training time from nearly 7 hours to just 30 minutes. Other than switching to the RAdam optimizer, we achieved these performance gains without changing a line of code in the model itself!

While we focused on optimizing the training loop in this post, in the next post we will go outside of the training loop and further optimize model training for the distributed setting. We’ll leverage some coming-soon product enhancements to make this happen, so stay tuned for the next installment.

Hopefully, this post has shown that distributed training with Determined on a real-world NLP transfer learning use case is both easy and highly valuable. If you’re ready for your training jobs to complete over lunch rather than overnight, give it a try and give me a shout to let me know how it goes. Happy training!


  1. The Stanford Question Answering Dataset (SQuAD) is a popular benchmark dataset for the reading comprehension task of extracting question answers from a given passage. 

  2. This is the sequential transfer learning (STL) paradigm, where a (usually costly) pretraining task is completed once, and the resulting pretrained model is the starting point from which we train models for downstream adaptation tasks. This blog offers a great explanation of STL and other flavors of transfer learning in NLP. 

  3. A Survey of the State-of-the-Art Language Models up to Early 2020 

  4. Other Sesame Street characters have since joined the NLP party, with Big Bird most recently being introduced with a specialization in long word sequences. 

  5. 132 hours × $8 per TPU hour (the cost at the time of writing) × 512 

  6. E.g., XLNet models are available here

  7. The current state of the art uses a different model: SA-Net on Albert (ensemble). 

  8. Specifically, in order to report on model effectiveness, we only run a single validation pass at the end of training. Training time reported in this blog is the duration observed from the time we train our first batch to the time that we begin the validation pass. 

  9. The GPUs were connected via PCI-E. GCP offers NVIDIA Tesla V100s connected via NVLink, though for more than 5x the cost per GPU at the time of writing this blog. 

  10. This is a slight simplification. Determined’s distributed training implementation performs wait-free backpropagation, meaning that gradient updates are communicated layer by layer. This enables communication to happen in parallel with the backwards pass. 

  11. This comparison reports NVLink bandwidth ~3x greater than PCI-E.