The core steps needed to run distributed training with RaySGD are the following:
Write the code for the creation of each component:
model_creator
data_creator
optimizer_creator
loss_creator
TrainingOperator
(apply my custom training methods, along with validation; see the corresponding docs, as well as the API)
Bonus:
Implement use_tqdm
Implement use_fp16
(it actually seems like it should work, even with a custom training method, without further changes; I just need to set it to True
)
Create a main script to run the distributed training
main.py
(the full script where I call the RaySGD to train and validate the model)
(recommended) Create a configuration yaml file
If we want to use a cluster in a cloud computing platform, there are some extra steps: