chenwydj
3/3/2020 - 1:04 AM

Runtime error using Distributed with gloo

RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: 10.37.0.1 at /pytorch/torch/lib/THD/process_group/General.cpp:17

https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579

By default, both NCCL and Gloo backends will try to find the network interface to use for communication. However, this is not always guaranteed to be successful from our experiences. Therefore, if you encounter any problem on either backend not being able to find the correct network interface. You can try to set the following environment variables (each one applicable to its respective backend):

BTW, use ifconfig to find your first Ethernet interface.