Last Updated on 2023-06-07 by Clay
Problem
Today when I training the model on my server, I write an GPU parallel training script, and input the newest training data; but in the process of training, I got an error message about "GPU not found". After I using torch.cuda.is_available()
to check my device, I got another error message as following:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
according the error message show, this is an unknown CUDA error, maybe it is caused by wrong configuration.
It's very weird! of course I install the GPU driver and configure the CUDA, even have trained many AI models. It is not make sense.
This is unreasonable.
Solution
Follow the discussions from forum (link attach on the References), we have two choices:
- Restart your server (but some people say that is not useful for their situation)
- Or run the following commands:
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
It will suspend and reuse. I used it and success got the true response from torch.cuda.is_available()
.
References
- https://discuss.pytorch.org/t/userwarning-cuda-initialization-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set-up-environment-e-g-changing-env-variable-cuda-visible-devices-after-program-start-setting-the-available-devices-to-be-zero/129335/2
- https://stackoverflow.com/questions/66857471/cuda-initialization-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set