Last updated: Apr 8, 2024
Reading time·4 min
The Python "UnicodeDecodeError: 'charmap' codec can't decode byte in position"
occurs when we specify an incorrect encoding or don't explicitly set the
encoding
keyword argument when opening a file.
To solve the error, specify the correct encoding, e.g. utf-8
.
Here is an example of how the error occurs.
I have a file called example.txt
with the following contents.
𝘈Ḇ𝖢𝕯٤ḞԍНǏ hello world
And here is the code that tries to decode the contents of example.txt
.
# ⛔️ UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1: character maps to <undefined> with open('example.txt', 'r', encoding='cp856') as f: lines = f.readlines() print(lines)
The error is caused because the example.txt
file doesn't use the specified
encoding (cp856
).
𝘈Ḇ𝖢𝕯٤ḞԍНǏ hello world
If you know the encoding the file uses, make sure to specify it using the
encoding
keyword argument.
Otherwise, the first thing you can try is setting the encoding to utf-8
.
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() # ✅ ['𝘈Ḇ𝖢𝕯٤ḞԍНǏ\n', 'hello world'] print(lines)
utf-8
encoding is capable of encoding over a million valid character code points in Unicode.The same approach can be used if you use the open() function directly instead of using the with statement.
my_file = open('example.txt', 'r', encoding='utf-8') lines = my_file.readlines() print(lines) # ['𝘈Ḇ𝖢𝕯٤ḞԍНǏ\n', 'hello world']
You can view all of the standard encodings in this table of the official docs.
Some of the common encodings are ascii
, latin-1
and utf-32
.
patlib
moduleIf you use the pathlib
module, specify an encoding when calling the specific
method.
from pathlib import Path text = Path('example.txt').read_text(encoding='utf-8') # 𝘈Ḇ𝖢𝕯٤ḞԍНǏ # hello world print(text)
You can pass the encoding when calling methods such as Path.read_text or Path.write_text.
If the error persists, you could set the
errors keyword argument
to ignore
to ignore the characters that cannot be decoded.
Note that ignoring characters that cannot be decoded can lead to data loss.
# 👇️ Set errors to ignore with open('example.txt', 'r', encoding='utf-8', errors='ignore') as f: lines = f.readlines() # ✅ ['𝘈Ḇ𝖢𝕯٤ḞԍНǏ\n', 'hello world'] print(lines)
Opening the file with an incorrect encoding with errors
set to ignore
won't
raise a UnicodeDecodeError
.
with open('example.txt', 'r', encoding='cp856', errors='ignore') as f: lines = f.readlines() # ✅ ['\xadרט©ז\xadצ\xadץ»┘©×םן\n', 'hello world'] print(lines)
The characters that cannot be decoded are simply ignored.
If you don't need to interact with the contents of the file, you can open it in binary mode without decoding it.
with open('example.txt', 'rb') as f: lines = f.readlines() # ✅ [b'\xf0\x9d\x98\x88\xe1\xb8\x86\xf0\x9d\x96\xa2\xf0\x9d\x95\xaf\xd9\xa4\xe1\xb8\x9e\xd4\x8d\xd0\x9d\xc7\x8f\n', b'hello world'] print(lines)
We opened the file in binary mode (using the rb
- read binary mode), so the
lines
list contains bytes objects.
You can use this approach if you need to upload the file to a remote server and don't need to decode it.
string
to a bytes
object and decoding is the process of converting a bytes
object to a string
.When decoding a bytes object, we have to use the same encoding that was used to encode the string to a bytes object.
cp437
encodingIf the error persists, try to use the cp437 encoding when opening the file.
with open('example.txt', 'r', encoding='cp437') as f: lines = f.readlines() # ✅ ['≡¥ÿêß╕å≡¥ûó≡¥ò»┘ñß╕₧╘ì╨¥╟Å\n', 'hello world'] print(lines)
The Code page 437 encoding is the character set of the original IBM personal computer and includes all printable ASCII characters as well as some accented letters.
If you still get an error, set the errors
keyword argument to ignore
in the
call to the
open() function.
with open('example.txt', 'r', encoding='cp437', errors='ignore') as f: lines = f.readlines() # ✅ ['≡¥ÿêß╕å≡¥ûó≡¥ò»┘ñß╕₧╘ì╨¥╟Å\n', 'hello world'] print(lines)
The characters that cannot be decoded are simply ignored which may cause data loss.
If the error persists, try other encodings such as utf-16
, utf-32
,
latin-1
, etc.
You can try to figure out what the encoding of the file is by using the file
command.
The command is available on macOS and Linux, but can also be used on Windows if you have Git and Git Bash installed.
Make sure to run the command in Git Bash if on Windows.
Open your shell in the directory that contains the file and run the following command.
file *
The screenshot shows that the file uses the ASCII
encoding.
This is the encoding you should specify when opening the file.
with open('example.txt', 'r', encoding='ascii') as f: lines = f.readlines() print(lines)
If you are on Windows, you can also:
The screenshot shows that the encoding for the file is UTF-8
, so that's what
we have to specify when calling the
open() function.
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() print(lines)
latin-1
encodingIf the error persists, try to use the latin-1 encoding when opening the file.
with open('example.txt', 'r', encoding='latin-1') as f: lines = f.readlines() # ['ð\x9d\x98\x88á¸\x86ð\x9d\x96¢ð\x9d\x95¯Ù¤á¸\x9eÔ\x8dÐ\x9dÇ\x8f\n', 'hello world'] print(lines)
Make sure to check if you get legible results when using the latin-1
encoding.
Here is an example that shows how using a different encoding to encode a string to bytes than the one used to decode the bytes object causes the error.
my_text = '𝘈Ḇ𝖢𝕯٤ḞԍНǏ' my_binary_data = my_text.encode('utf-8') # ⛔️ UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1: character maps to <undefined> my_text_again = my_binary_data.decode('cp856')
We can solve the error by using the utf-8
encoding to decode the bytes object.
my_text = '𝘈Ḇ𝖢𝕯٤ḞԍНǏ' my_binary_data = my_text.encode('utf-8') # 👉️ b'\xf0\x9d\x98\x88\xe1\xb8\x86\xf0\x9d\x96\xa2\xf0\x9d\x95\xaf\xd9\xa4\xe1\xb8\x9e\xd4\x8d\xd0\x9d\xc7\x8f' print(my_binary_data) # ✅ Specify the correct encoding my_text_again = my_binary_data.decode('utf-8') print(my_text_again) # 👉️ '𝘈Ḇ𝖢𝕯٤ḞԍНǏ'