Tips & Tricks | Notion

In order to gain access to Ray's dashboard, we need to do a port forward in the ray submit call or an ssh tunnel using the cloud provider's API. In this example, I'm assuming that:
- Ray cluster's head node is in an instance called ray-default-head-071df709
- My project name is perseids-scholarship
- The instance is in zone europe-west2-a
- The dashboard is in port 8265 (which is usually where it's at)
- I want to access the dashboard on my computer through the URL http://localhost:8000/
- ~~In Azure, I might also need to allow access to port 8265 by adding an inbound rule in the corresponding Network Security Group.~~

Directly in Ray:

ray submit config.yaml script.py --start --stop --port-forward 8265

In ssh in general:

ssh -L 8000:localhost:8265 gw.example.com

In GCP:

gcloud compute ssh ray-default-head-071df709 \\
    --project perseids-scholarship \\
    --zone europe-west2-a \\
    -- -L 8000:localhost:8265

In Azure:

ssh -L 8000:localhost.239:8265 ubuntu@vm-public-ip

As Ray wraps the model in PyTorch's DistributedDataParallel, we lose direct access to the model's custom attributes. In order to get those attributes' values again, we must use model.module, assuming that model is the original model wrapped in DDP.

# Before Ray
model.custom_att           # Works

# After Ray
model.custom_att           # Fails
model.module.custom_att    # Works

In RaySGD's setting, it appears that PyTorch forces the head node's GPU to save 2x the memory required for a single batch's tensor, probably as a precaution.

RuntimeError: CUDA out of memory. Tried to allocate 5.54 GiB (GPU 0; 15.90 GiB total capacity; 11.08 GiB already allocated; 4.10 GiB free; 11.09 GiB reserved in total by PyTorch)