YogaDL: a better approach to data loading for deep learning models

At Determined AI, we enable deep learning engineers to train better models more quickly and to focus on data science rather than managing infrastructure. One of the major pain points that we have observed in training models is the process of loading data. In a previous blog post, we described how tf.data.Dataset’s focus on sequential rather than random access leads to challenges supporting common deep learning tasks such as shuffling data, sharding data for distributed training, and efficiently restoring workloads after failures.

Today, we are excited to announce that we are open-sourcing YogaDL under the Apache 2.0 license. YogaDL provides a better approach to data loading and API-transparent caching to local storage, AWS S3, and Google Cloud Storage.

A better approach for data loading

YogaDL is designed to be two things: a standalone caching layer to imbue existing data loaders with the properties that come from a random-access layer, and a better interface for defining data loaders in general.

YogaDL provides both a random-access layer and a sequential-access layer. As we argued recently, supporting efficient random access is critical for good training infrastructure. Direct random access to any record enables:

  1. Shuffling (potentially every epoch).
  2. Pausing/continuing training mid-epoch.
  3. Sharding the dataset efficiently for distributed training.

The sequential-access layer enables:

  1. Prefetching data loading to hide latency costs.
  2. Parallelizing data loading to hide compute costs.

YogaDL enables random access by caching datasets. A dataset is cached by iterating over it before the start of training and storing the output to an LMDB file. The caching, which can be done on a local file system, S3, or GCS, enables random access, dataset versioning, and efficient data access. Once the dataset is cached, YogaDL provides a random-access layer followed by a sequential-access layer. It does this by introducing the YogaDL.DataRef interface, which creates an explicit boundary between the random- and sequential-access layers.

Currently, YogaDL accepts tf.data.Dataset as an input and returns a YogaDL.Stream, which can output either a tf.data.Dataset or a Python generator. Support for additional data frameworks, such as tf.keras sequences and PyTorch DataLoaders, is on our near-term roadmap.

YogaDL in action

import yogadl
import tensorflow as tf

# Initialize YogaDL Storage.
config = yogadl.storage.LFSConfigurations(
    storage_dir_path="/tmp/yogadl_cache"
)
storage = yogadl.storage.LFSStorage(config)

# Cache the dataset.
@storage.cacheable(dataset_id="example-dataset", dataset_version="1")
def make_records():
    # Dataset that gets cached.
    return tf.data.Dataset.range(10)

# Create the random-access layer.
dataref = make_records()

# Create the sequential-access layer.
stream = dataref.stream(
      start_offset=3,
      shuffle=True,
      shuffle_seed=23,
      shard_rank=2,
      num_shards=4
)

# Convert to a tf.data.Dataset().
tf_dataset = yogadl.tensorflow.make_tf_dataset(stream)

# Read the dataset.
batches = tf_dataset.repeat(3).batch(5)
for batch in batches:
    print(batch)

This code snippet shows how YogaDL can be used with tf.data.Dataset. Compared to using just a tf.data.Dataset, YogaDL enables users to:

  1. Efficiently shuffle. There’s no need to keep an in-memory shuffle buffer the way tf.data.Dataset requires.
  2. Efficiently shard the dataset for distributed training. With tf.data.Dataset, to shard the dataset among workers to perform distributed training, oftentimes every worker needs to iterate through the entire dataset. With YogaDL, every worker reads only the data associated with its shard, which is much more efficient.
  3. Efficiently start training from anywhere in the dataset. No need to iterate through all the skipped data items; instead, restart training exactly where it stopped. Especially important when training with large datasets.
  4. Cache to the local file system, S3, or GCS. Switching caching locations can be done in a few lines of code, by switching between LFSStorage, S3Storage, and GCSStorage.
  5. Version datasets. To change between different datasets or versions of a dataset, just update the dataset ID or version.

What YogaDL is not?

YogaDL is not a data manipulation API: the world has more than enough of those. Instead, YogaDL seeks to be API-transparent, so that you can continue to use your existing data loading code but with all the benefits of a high-performance, random-access cache. If you have data augmentation steps which cannot be cached, that code should continue to work without any modifications.

YogaDL offers basic dataset versioning, but it is not currently a full-blown version control for datasets. Offering something like version control for datasets is on the roadmap as well.

Getting started with YogaDL

YogaDL can be installed via pip install yogadl. To get started, check out the GitHub repository, the documentation, and join our community slack.

YogaDL with Determined

In addition to offering YogaDL as a standalone library, we have integrated it into our open-source training platform, Determined. To get started with using YogaDL in Determined, take a look at our documentation and examples.