[Solved] RuntimeError: CUDA error: device-side assert triggered

Today when I using PyTorch framework to train a simple classifier, I got an error message like following:

RuntimeError: CUDA error: device-side assert triggered"

I have had similar experience before and I have successfully solved it, but I don’t remember how to do.

This is the disadvantage of not taking notes.

After restart my remote server, this error is still exist. I think we can rule out hardware problems.

I will record possible solutions below.

Exception exclusion

First, I looked for the discussion on GitHub issues, some people recommend using the CPU to run and check if the same problem still exists.

(But I still have this error)

The next suggestion I saw is to check “whether -1 exists in the labels of the training data“.

My data is labelled by myself, it is impossible for this problem. But I still went to confirm my label.

Then I found the problem: The labels of data I decided is from 1-3140, but the final layer only has 3139 neurons set to output.

After I add the neuron in the last layer to 3140, the problem was solved!