[已解決] UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0

Last Updated on 2023-06-07 by Clay

問題描述

今天我在我的伺服器訓練模型時，我寫好了一份多片 GPU 平行化訓練的腳本，接著把最新的資料輸入給模型開始訓練；但是在訓練過程中，我得到了沒有 GPU 的錯誤訊息。當我使用 torch.cuda.is_available() 確認時，得到了以下錯誤訊息。

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

依照錯誤訊息顯示，這是 CUDA 未知的錯誤，可能是環境的錯誤配置。
這是非常奇怪的！在我的環境中當然是早已設定好了 CUDA 環境跟 GPU 驅動程式，也不知道訓練跟跑過多少 AI 模型了。這樣的錯誤一點道理都沒有。

解決方法

根據論壇上的討論（連結放於文末 Reference），我們可以選擇：

重新啟動（然後有人提出說這對他們的裝置沒有效果）
執行以下指令：

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

這將其暫停後重新使用。

我率先使用了方法二並且成功從 torch.cuda.is_available() 得到了 true 的回應。

問題描述

解決方法

References

Read More

相關

Leave a Reply取消回覆

問題描述

解決方法

References

Read More

分享此文：

相關

Leave a Reply取消回覆