Today when I using PyTorch framework to train a simple classifier, I got an error message like following:
RuntimeError: CUDA error: device-side assert triggered"
I have had similar experience before and I have successfully solved it, but I don't remember how to do.
This is the disadvantage of not taking notes.
After restart my remote server, this error is still exist. I think we can rule out hardware problems.
I will record possible solutions below.
Exception exclusion
Step 1: Use CPU to test
First, I looked for the discussion on GitHub issues, some people recommend using the CPU to run and check if the same problem still exists.
(But I still have this error)
Step 2: Check your labels
The next suggestion I saw is to check "whether -1 exists in the labels of the training data".
My data is labelled by myself, it is impossible for this problem. But I still went to confirm my label.
Then I found the problem: The labels of data I decided is from 1-3140, but the final layer only has 3139 neurons set to output.
After I add the neuron in the last layer to 3140, the problem was solved!