April 20, 2020
ML Engineering is a cross-disciplinary function straddling machine learning and systems—the ML Engineer leverages their data science know-how to develop new models, but they don’t simply build one model in a silo, forget about it, and move onto the next use case:
Determined AI and AWS SageMaker are both platforms that accelerate these Machine Learning Engineering workflows, but with key differences that the following comparison chart outlines, in detail. The platform comparison is separated into features that fall into four groups:
Infrastructure: How do the two platforms compare on infrastructure support and management?
Determined AI | AWS SageMaker | |
---|---|---|
Infrastructure platform support | Determined runs on-premise and on any cloud provider. | Exclusive AWS managed service offering. |
Resource utilization | Fine-grained resource allocation and fault tolerance features, purpose-built for teams to deliver models more quickly, with fewer resources, at a significantly lower cost compared to SageMaker. | Each training job utilizes an entire EC2 instance, which can lead to underutilization if it is not leveraging all GPUs. |
Autoscaling | Determined’s worker (agent) pool is elastic – it automatically scales up and down to support users’ workloads. | SageMaker also scales underlying resources up and down automatically, though jobs cannot concurrently share GPUs on the same instance. This inefficiency leads to higher infrastructure costs, particularly in settings where jobs run concurrently. |
Infrastructure Cost Effectiveness | Compared to SageMaker, Determined saves infrastructure cost, especially in collaborative environments, due to finer-grained resource utilization, distributed training and industry-leading hyperparameter search. | SageMaker is priced as an uplift on EC2 instance costs. This uplift can be up to 40% higher, which may be cost-effective for small scale experimentation, but becomes cost-prohibitive at scale. |
ML Developer Ecosystem: How do the two platforms compare in terms of integration with common Machine Learning Engineer tools?
Determined AI | AWS SageMaker | |
---|---|---|
ML Library Support | Determined supports the leading ML libraries: PyTorch, TensorFlow, and Keras. This abstraction layer allows ML Engineers to focus on model development in a fault-tolerant environment that runs jobs across machines automatically. | SageMaker also supports the leading ML libraries, but without the trial abstraction layer offered in Determined. |
Distributed Training | Determined provides an abstraction layer for PyTorch, TensorFlow, and TF Keras so that training can run in distributed mode (leveraging GPUs across multiple machines) without the ML Engineer having to write code. | Because SageMaker runs training jobs via entry point scripts, it’s possible to programmatically integrate a distributed training library like Horovod to parallelize a training job. In Determined, distributed training happens automatically. |
Notebooks | Cloud-hosted one-click Jupyter notebooks. The notebooks run in a container on an agent node, improving resource utilization. | Similarly, SageMaker offers cloud-hosted one-click Jupyter notebooks. The notebooks run on a standalone EC2 instance. |
Model Lifecycle: What tools are available to help the developer build and deploy models more efficiently?
Determined AI | AWS SageMaker | |
---|---|---|
Built-in ML algorithms | Determined doesn’t offer algorithms out of the box. Determined’s platform is designed to give ML engineers full flexibility to design and customize their own algorithms. | SageMaker offers built-in common algorithms, though flexibility is limited and therefore many SageMaker users leverage the popular ML libraries directly instead. |
Hyperparameter Search | Determined offers a broad range of hyperparameter search algorithms, including random, grid, PBT and adaptive. Adaptive is a state of the art approach to hyperparameter search that allocates more resources to more promising configurations while quickly abandoning poor ones. | SageMaker hyperparameter search is limited to random and Bayesian, and it requires additional work to conform code to satisfy SageMaker’s requirements. |
Training Job Resiliency | Training jobs regularly and automatically save checkpoints when using Determined’s abstraction layer. The engineer does not need to implement the checkpointing logic in the code. | SageMaker jobs can also run on spot instances, but fault tolerance in this setting comes at a steep implementation and maintenance cost: it’s up to the ML engineer to implement checkpointing in every job they submit in order to resume rather than restart training jobs should a spot instance be reclaimed. |
Inference | Determined supports model export for easy deployment into your serving environment. | SageMaker offers integrated model deployment hosted in AWS, though at a significantly higher cost (40%) per instance. This becomes cost-prohibitive at scale. The hosted models are particular about input format, meaning that upstream customization is often still required if input transformation is needed. |
Experiment Tracking: How do ML Engineers track experiments, reproduce results, and collaborate with their team?
Determined AI | AWS SageMaker | |
---|---|---|
Experiment metadata tracking | Determined’s DB tracks all experiment metadata over time. Metadata includes description, labels, experiment configuration (e.g., the hyperparameters and search algorithm used), and trained model weights. Determined is also integrated with Tensorboard for deeper analysis. | SageMaker also offers a navigable history of experiment metadata. Although, the user will still need to write some code against the Sagemaker API to search and manage artifacts. |
Forking | Many experiments evolve from previous experiments, and Determined’s experiment forking feature allows users to quickly clone and modify experiments. | SageMaker experiments can also be forked. |