Today I want to record a common problem, its solution is very rarely. Simple to put, the error message as follow:
RuntimeError: CUDA out of memory. Tried to allocate 2.0 GiB.
This error is actually very simple, that is your memory of GPU is not enough, causing the training data we want to train in the GPU to be insufficiently stored, causing the program to stop unexpectedly.
For Linux, the memory capacity seen with nvidia-smi
command is the memory of GPU; while the memory seen with htop
command is the memory normally stored in the computer for executing programs, the two are different.
Solution
If you encounter this problem during data training, it is usually the problem of too large Batch Size. Just imagine: Giving a huge amount of data to the GPU at a time, is it easy for the memory to overflow?
Conversely, if the data lost at a time is smaller, and then it is cleared after training, and the next batch of data comes in, it can avoid GPU overflow.
So if it is the training phase, reducing the Batch Size is a method that can be considered.
But if you have the problem during the testing, it may be because the gradient of the model is still accumulating.
In PyTorch, we need to change the model mode to eval()
mode, and put the model testing under the with torch.no_grad()
.
In this way, the model does not accumulate gradients.
References
Read More
- [Solved][PyTorch] return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried to access index 5 out of table with 4 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237
- [Solved][PyTorch] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
- [Solved][PyTorch] TypeError: not a sequence
- [Solved][PyTorch] ValueError: expected sequence of length 300 at dim 1 (got 3)