When we using python for data. analysis, especially natural language processing tasks, it is difficult to avoid dealing with some large scale file.
This time, if you always use open()
function to open these big files, you may get some memory error messages.
This is because the open()
function in python loads whole files into memory by default.
So, how can we change our approach? The more common methods are:
- Use
with
to load the file - Use
read([size])
to control a chunk of file reading
In this way, we can avoid memory errors!
Oh, another way is to buy more memory.
Restore the situation where the error was reported
First of all, let me record the situation where the error may be reported:
text = open('data.txt', 'r', encoding='utf-8').read().split('\n') for line in text: print(line)
When dealing with small files, I personally find the method here to be quite convenient. After all, the "text" when it comes in is stored line by line of sentences. However, when dealing with large files, this will load all the data into the memory at once, putting a relatively large burden on the memory.
Use with
to open file
with open('text.txt', 'r', encoding='utf-8') as f: for line in f: print(line)
We can see that when reading large files, this opening method is obviously much faster and will not report errors.
Use read([size])
to open file
So since it is good to use with
to open file, why use read([size])
?
This is because, sometimes there may not be a "line break" in our text. What's bad is that reading with with
is no different from reading all in one go.
Because of this possibility, it is necessary to use read([size])
to control the number of reads at a time when necessary.
with open('text.txt', 'r', encoding='utf-8') as f: for chunk in iter(lambda: f.read(1024), ''): print(chunk)