[已解決] RuntimeError: CUDA error: device kernel image is invalid - CUDA kernel errors might be asynchronously reported at some other API call...

Last Updated on 2022-07-27 by Clay

問題描述

最近我的某項工作就是把之前的舊專案使用 PyTorch Lightning 重構成新的訓練環節，並確保分數並沒有太大變化。其中，在我將某項二分類專案重構後，試跑出現了以下錯誤：

RuntimeError: CUDA error: device kernel image is invalid
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Segmentation fault (core dumped)

解決方法

方法一：使用 CPU 執行任務

首先，這份報錯訊息其實並沒有實際指出問題所在。為了精確定位問題，我們可以考慮的第一個方法就是直接使用 CPU 去跑，確認是否仍然會有問題出現。

方法二：加入 "CUDA_LAUNCH_BLOCKING=1" 參數執行程式

首先，報錯說明其實已經提及了，如果想要 debug，需要傳入 CUDA_LAUNCH_BLOCKING=1 當作參數，以獲取更多的錯誤資訊。

所以要加入參數使用類似以下指令：

CUDA_LAUNCH_BLOCKING=1 python

順帶一提，在我的任務中加入此參數執行時，出現的報錯資訊為：

RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast. Many models use a sigmoid layer right before the binary cross entropy layer.
In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits or torch.nn.BCEWithLogitsLoss.  binary_cross_entropy_with_logits and BCEWithLogits are safe to autocast.

簡單來講，我自行設定了輸出時經過 sigmoid 層並輸入 Binary Cross Entropy 的函式來計算損失函數（loss function），而 autocast 時程式覺得容易出錯；它推薦直接使用 BCEWithLogitsLoss() 和 binary_cross_entropy_with_logits() 函式。

當我修改了這些寫法後，我的任務成功開始運作了。

[已解決] RuntimeError: CUDA error: device kernel image is invalid - CUDA kernel errors might be asynchronously reported at some other API call...

問題描述

解決方法

方法一：使用 CPU 執行任務

方法二：加入 "CUDA_LAUNCH_BLOCKING=1" 參數執行程式

References

Read More

Leave a Reply取消回覆

[已解決] RuntimeError: CUDA error: device kernel image is invalid - CUDA kernel errors might be asynchronously reported at some other API call...

問題描述

解決方法

方法一： 使用 CPU 執行任務

方法二： 加入 "CUDA_LAUNCH_BLOCKING=1" 參數執行程式

References

Read More

分享此文：

Leave a Reply取消回覆

方法一：使用 CPU 執行任務

方法二：加入 "CUDA_LAUNCH_BLOCKING=1" 參數執行程式