UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte

avatar

Borislav Hadzhiev

Last updated: May 2, 2022

banner

Photo from Unsplash

UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte #

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. latin-1.

unicodedecodeerror utf 8 codec cant decode byte

Here is an example of how the error occurs.

main.py
my_bytes = 'one é two'.encode('latin-1') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte my_str = my_bytes.decode('utf-8')
Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

When decoding a bytes object, we have to use the same encoding that was used to encode the string to a bytes object.

In the example, we can set the encoding to latin-1.

main.py
my_bytes = 'one é two'.encode('latin-1') my_str = my_bytes.decode('latin-1') print(my_str) # 👉️ "one é two"

If you got the error when reading from a file using pandas, try setting the encoding to latin-1.

main.py
import pandas as pd # 👇️ set encoding to latin-1 df = pd.read_csv('employees.csv', sep='|', encoding='latin-1') print(df)

You can try doing the same if using the native open() function.

main.py
import csv with open('employees.csv', newline='', encoding='latin-1') as csvfile: csv_reader = list(csv.reader(csvfile, delimiter='|')) print(csv_reader)

If the error persists, you could set the errors keyword argument to ignore to ignore the characters that cannot be decoded.

Note that ignoring characters that cannot be decoded can lead to data loss.

main.py
import csv # 👇️ set errors to ignore with open('employees.csv', newline='', encoding='utf-8', errors='ignore') as csvfile: csv_reader = list(csv.reader(csvfile, delimiter='|')) print(csv_reader)

Opening the file with an incorrect encoding with errors set to ignore won't raise a UnicodeDecodeError.

Make sure you didn't open a file in rb (binary) mode if you have to read from it.

If you don't need to interact with the contents of the file, you can open it in binary mode without decoding it.
main.py
with open('example.txt', 'rb') as f: lines = f.readlines() # ✅ [b'\xf0\x9d\x98\x88\xe1\xb8\x86\xf0\x9d\x96\xa2\xf0\x9d\x95\xaf\xd9\xa4\xe1\xb8\x9e\xd4\x8d\xd0\x9d\xc7\x8f\n'] print(lines)

We opened the file in binary mode (using the rb mode), so the lines list contains bytes objects.

You shouldn't specify encoding when opening a file in binary mode.

You can use this approach if you need to upload the file to a remote server and don't need to decode it.

Note that if you are trying to read from or write to a PDF file, you have to use the rb or wb modes as PDF files are stored as bytes.

main.py
with open('example.pdf', 'rb') as file1: my_bytes = file1.read() # 👇️ specify correct encoding print(my_bytes.decode('latin-1'))
Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

When decoding a bytes object, we have to use the same encoding that was used to encode the string to a bytes object.

Here is an example that shows how using a different encoding to encode a string to bytes than the one used to decode the bytes object causes the error.

main.py
my_text = 'one æåäãé two' my_binary_data = my_text.encode('latin-1') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 4: invalid continuation byte my_text_again = my_binary_data.decode('utf-8')

We can solve the error by using the latin-1 encoding to decode the bytes object.

main.py
my_text = 'one æåäãé two' my_binary_data = my_text.encode('latin-1') my_text_again = my_binary_data.decode('latin-1') print(my_text_again) # "one æåäãé two"

Conclusion #

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. latin-1.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.