December 06, 2023
The Determined Python SDK, a part of the broader Determined Python library, is designed to perform tasks such as:
In other words, it has many of the same capabilities as the Determined CLI, with the benefit of allowing you to write all your logic in Python. To see how it works, let’s walk through a script that demonstrates some of the SDK’s features.
If you’d prefer to run the script now instead of reading further, then clone this GitHub repo and follow the instructions in the readme. If you have any questions, feel free to start a discussion on GitHub or join our Slack Community.
Imagine we’re on a team of engineers that works on many different projects. Our current project is to train image classification models on three different MedMNIST datasets. To achieve this goal we’ll want to:
The script:
First, we import the necessary packages and create some global variables to hold the names of our workspace, projects etc.
from typing import Dict, List, Optional
from determined.common.api import errors
from determined.experimental import client
import medmnist
import yaml
WORKSPACE = "SDK Demo" # The workspace that contains the projects
PROJECT = "MedMNIST" # The project that contains experiments
MODEL_DIR = "mednist_model" # Where the config and model_def files live
# We'll train models on the these 3 MedMNIST datasets
DATASETS = ["dermamnist", "bloodmnist", "retinamnist"]
DEMO_VERSION = "demoV1"
Next, we use the SDK to create a workspace and project, if they don’t already exist.
Which parts of the code use the Determined Python SDK? The parts with client
, like client.create_workspace(workspace_name)
, and any of its derivatives, like workspace.list_projects()
.
def setup_projects(workspace_name: str, project_names: List[str]) -> None:
try:
workspace = client.get_workspace(workspace_name)
except errors.NotFoundException:
print(f"Creating workspace '{workspace_name}'")
workspace = client.create_workspace(workspace_name)
workspace_project_names = [project.name for project in workspace.list_projects()]
for name in project_names:
if name not in workspace_project_names:
print(f"Creating project '{name}'")
workspace.create_project(name)
Here we archive experiments that are in our current project, have the same name as any of the experiments we want to run, and are in a non-running state.
The SDK is used to:
project_id = client.get_workspace(workspace_name).get_project(project_name).id
exps = client.list_experiments(name=name, project_id=project_id)
if exp.state.value in (...)
exp.archive()
Here’s the function:
def archive_experiments(
experiment_names: List[str], workspace_name: str, project_name: str
) -> None:
project_id = client.get_workspace(workspace_name).get_project(project_name).id
for name in experiment_names:
exps = client.list_experiments(name=name, project_id=project_id)
for exp in exps:
if not exp.archived:
if exp.state.value in (
client.ExperimentState.COMPLETED.value,
client.ExperimentState.CANCELED.value,
client.ExperimentState.DELETED.value,
client.ExperimentState.ERROR.value,
):
print(f"Archiving experiment {exp.id} (dataset={exp.name})")
exp.archive()
else:
print(
f"Not archiving experiment {exp.id} (dataset={exp.name}) because it is"
f" still in state {exp.state}"
)
To organize checkpoints, we create models* in the model registry. We move these newly created models to our workspace using
model.move_to_workspace(workspace_name=workspace)
This can help keep things organized and easier to find, especially in a large team setting where many people are creating models for different projects.
*A registry model is like a tag or folder that can contain multiple checkpoints. Each checkpoint within a model can have its own metadata and user notes.
Our function creates a single model, and we’ll call this function in a for-loop to create multiple models.
def create_model(name: str, workspace: str) -> None:
workspace_id = client.get_workspace(workspace).id
try:
model = client.get_model(name)
print(f"Using existing model '{name}' from the registry")
except errors.NotFoundException:
print(f"Creating new model '{name}' in the registry")
model = client.create_model(name=name)
if model.workspace_id != workspace_id:
model.move_to_workspace(workspace_name=workspace)
This next function takes care of experiment configuration, creation, and labeling.
config.yaml
), which contains common settings that will be used by all of our experiments. Then it modifies the configuration with dataset-specific information, like the number of samples per epoch. Note that another approach is to discard the yaml file, and instead define the entire configuration within the Python code.def run_experiment(
dataset: str, workspace: str, project: str, labels: Optional[str]
) -> client.Experiment:
with open(f"{MODEL_DIR}/config.yaml", "r") as file:
exp_conf: Dict[str, str] = yaml.safe_load(file)
exps = []
# Set configuration particular to this dataset and example script
exp_conf["name"] = dataset
exp_conf["workspace"] = workspace
exp_conf["project"] = project
exp_conf["records_per_epoch"] = medmnist.INFO[dataset]["n_samples"]["train"]
exp_conf["hyperparameters"]["data_flag"] = dataset
print(f"Starting experiment for dataset {dataset}")
exp = client.create_experiment(config=exp_conf, model_dir=MODEL_DIR)
print(f"Experiment {dataset} started with id {exp.id}")
for label in labels:
exp.add_label(label)
return exp
When an experiment finishes*, we will:
COMPLETED
state, rather than some other state like ERROR
.*The exp.wait()
command returns an exit status when an experiment finishes running.
def finish_experiment(exp: client.Experiment) -> client.Checkpoint:
exit_status = exp.wait()
print(f"Experiment {exp.id} completed with status {exit_status}")
if exit_status == client.ExperimentState.COMPLETED:
checkpoints = exp.list_checkpoints(
max_results=1,
sort_by=client.CheckpointSortBy.SEARCHER_METRIC,
order_by=client.OrderBy.DESCENDING,
)
return checkpoints[0]
else:
raise RuntimeError(
f"Experiment {exp.name} (id={exp.id}) did not complete successfully."
f" It is currently in state {exp.state}"
)
Finally, we have the main
function that calls all of the above functions, passing in the names of our workspace, project, datasets, and experiment label.
In the for-loop at the very end of our main
function, we add the experiments’ best checkpoints to the model registry. We also add a note to each checkpoint using the set_notes
function, so that anyone who views the checkpoint can understand the context.
def main():
client.login() # Host address & user credentials can be optionally passed here
setup_projects(
workspace_name=WORKSPACE,
project_names=[PROJECT],
)
archive_experiments(
experiment_names=DATASETS,
workspace_name=WORKSPACE,
project_name=PROJECT,
)
exps = []
for dataset in DATASETS:
create_model(name=dataset, workspace=WORKSPACE)
exps.append(
run_experiment(dataset, workspace=WORKSPACE, project=PROJECT, labels=[DEMO_VERSION])
) # Run the experiments in parallel
print("Waiting for experiments to complete...")
for exp in exps:
best_checkpoint = finish_experiment(exp)
# models and experiments are both named after their medmnist dataset
model = client.get_model(exp.name)
model_version = model.register_version(best_checkpoint.uuid)
model_version.set_notes(f"Creating using Determined SDK demo version {DEMO_VERSION}")
if __name__ == "__main__":
main()
To see the script in action, you’ll need additional files like the model training code and yaml config. Here’s how to get it running:
git clone https://github.com/determined-ai/determined-examples/
cd determined-examples/blog/python_sdk_demo
pip install -r requirements.txt
DET_MASTER
environment variable in your terminal.For example, if you’re running this locally:
export DET_MASTER=localhost:8080
python determined_sdk_demo.py
As the script runs, you should see the following print to your terminal:
Creating workspace 'SDK Demo'
Creating project 'MedMNIST'
Creating new model 'dermamnist' in the registry
Starting experiment for dataset dermamnist
Preparing files to send to master... 6.2KB and 5 files
Experiment dermamnist started with id 1
Creating new model 'bloodmnist' in the registry
Starting experiment for dataset bloodmnist
Preparing files to send to master... 6.2KB and 5 files
Experiment bloodmnist started with id 2
Creating new model 'retinamnist' in the registry
Starting experiment for dataset retinamnist
Preparing files to send to master... 6.2KB and 5 files
Experiment retinamnist started with id 3
Waiting for experiments to complete...
Waiting for Experiment 1 to complete. Elapsed 1.0 minutes
Waiting for Experiment 1 to complete. Elapsed 2.0 minutes
Waiting for Experiment 1 to complete. Elapsed 3.0 minutes
Waiting for Experiment 1 to complete. Elapsed 4.0 minutes
Waiting for Experiment 1 to complete. Elapsed 5.0 minutes
Experiment 1 completed with status ExperimentState.COMPLETED
Experiment 2 completed with status ExperimentState.COMPLETED
Experiment 3 completed with status ExperimentState.COMPLETED
In your WebUI, you should see something like the following:
The SDK Demo workspace:
The MedMNIST project in the SDK Demo workspace:
Three experiments in the MedMNIST project:
Three models in the model registry:
A checkpoint listed within each registry model:
The text Creating using Determined SDK demo version demoV1
in the Notes area of each checkpoint.
In this blog post, we gave a walkthrough of a script that uses the Determined Python SDK. For more information about the SDK, please see the documentation.
If you want to learn about the other ways you can use Determined, check out our blog post that gives a high-level summary of all the Determined APIs.
If you have any questions, feel free to start a discussion in our GitHub repo and join our Slack Community!