RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: 10.37.0.1 at /pytorch/torch/lib/THD/process_group/General.cpp:17
https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579
By default, both NCCL and Gloo backends will try to find the network interface to use for communication. However, this is not always guaranteed to be successful from our experiences. Therefore, if you encounter any problem on either backend not being able to find the correct network interface. You can try to set the following environment variables (each one applicable to its respective backend):
NCCL_SOCKET_IFNAME=eth0
GLOO_SOCKET_IFNAME=eth0
https://pytorch.org/docs/stable/distributed.html#environment-variable-initializationBTW, use ifconfig
to find your first Ethernet interface.