Distributed training

Basics

The core steps needed to run distributed training with RaySGD are the following:

  1. Write the code for the creation of each component:

    model_creator

    data_creator

    optimizer_creator

    loss_creator

    TrainingOperator (apply my custom training methods, along with validation; see the corresponding docs, as well as the API)

    Bonus:

    Implement use_tqdm

    Implement use_fp16 (it actually seems like it should work, even with a custom training method, without further changes; I just need to set it to True)

  2. Create a main script to run the distributed training

    main.py (the full script where I call the RaySGD to train and validate the model)

  3. (recommended) Create a configuration yaml file

If we want to use a cluster in a cloud computing platform, there are some extra steps:

  1. Make sure that we have enough quotas to setup the cluster that we need (i.e. permission to access enough resources, such as CPUs, GPUs, RAM, disk storage, etc; each cloud provider has their own quotas and requests pages)
  2. Create an automatic cluster setup yaml file