Basics

The core steps needed to run distributed training with RaySGD are the following:

Write the code for the creation of each component:

model_creator

data_creator

optimizer_creator

loss_creator

TrainingOperator (apply my custom training methods, along with validation; see the corresponding docs, as well as the API)

Bonus:

Implement use_tqdm

Implement use_fp16 (it actually seems like it should work, even with a custom training method, without further changes; I just need to set it to True)
Create a main script to run the distributed training

main.py (the full script where I call the RaySGD to train and validate the model)
(recommended) Create a configuration yaml file

If we want to use a cluster in a cloud computing platform, there are some extra steps:

Make sure that we have enough quotas to setup the cluster that we need (i.e. permission to access enough resources, such as CPUs, GPUs, RAM, disk storage, etc; each cloud provider has their own quotas and requests pages)
Create an automatic cluster setup yaml file