Skip to content

[Python] Use "chardet" package to determine the encoding of the file

When we using python to read a file, or use any editor to open the file, we open file with the wrong encoding that causes the text in the file to appear garbled.

Of course we are not happy to see this situation, so we need a method to know the file encoding exactly.

Today I recorded an article about use chardet python package to detect the encoding of the file. This method is not guaranteed to be accurate, but it can still be provided for our reference.


chardet

If you have no chardet package in your environment, you can use the following command to install it:

sudo pip3 install chardet


Assume I have a file named test_01.txt, and you can use the following code to analyze the encoding of the file:

import chardet

text = open('test_01.txt', 'rb').read()
print(chardet.detect(text))



Output:

{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

To be careful is that you must select the "rb" mode to open thefile. After all, python is read by default using unicode, so if we can't judge the encoding of such a file, if we open it directly, we will often get an error.


After take a look for a successful example, let's look at the more failed example. This file name is test_02.txt. I don't know its encoding until now.

import chardet

text = open('test_02.txt', 'rb').read()
print(chardet.detect(text))



Output:

{'encoding': None, 'confidence': 0.0, 'language': None}

This is what I said cannot be applied to every situation. We can see that "chardet" has no confidence, neither the encoding nor the language.


References

Leave a ReplyCancel reply

Exit mobile version