UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0

avatar

Borislav Hadzhiev

Mon May 02 20222 min read

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0 #

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode (rb or wb).

unicodedecodeerror utf 8 codec cant decode byte 0xff

Here is an example of how the error occurs.

main.py
my_bytes = 'hello ÿ'.encode('utf-16') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte my_str = my_bytes.decode('utf-8')
Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

When decoding a bytes object, we have to use the same encoding that was used to encode the string to a bytes object.

In the example, we can set the encoding to utf-16.

main.py
my_bytes = 'hello ÿ'.encode('utf-16') my_str = my_bytes.decode('utf-16') print(my_str) # 👉️ "hello ÿ"

If you got the error when opening a file, you can open the file in binary mode without decoding it.

main.py
with open('example.txt', 'rb') as f: data = f.read() print(data)

We opened the file in binary mode (using the rb mode), so the lines list contains bytes objects.

You shouldn't specify encoding when opening a file in binary mode.

You can use this approach if you need to upload the file to a remote server and don't need to decode it.

If you need to interact with the file, you could set the errors keyword argument to ignore to ignore the characters that cannot be decoded.

Note that ignoring characters that cannot be decoded can lead to data loss.

main.py
# 👇️ set errors to ignore with open('example.txt', 'r', encoding='utf-16', errors='ignore') as f: lines = f.readlines() print(lines)

Opening the file with an incorrect encoding with errors set to ignore won't raise a UnicodeDecodeError.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

When decoding a bytes object, we have to use the same encoding that was used to encode the string to a bytes object.

Here is an example that shows how using a different encoding to encode a string to bytes than the one used to decode the bytes object causes the error.

main.py
my_text = 'hello ÿ' my_binary_data = my_text.encode('utf-16') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte my_text_again = my_binary_data.decode('utf-8')

We can solve the error by using the utf-16 encoding to decode the bytes object.

main.py
my_text = 'hello ÿ' my_binary_data = my_text.encode('utf-16') my_text_again = my_binary_data.decode('utf-16') print(my_text_again) # 👉️ "hello ÿ"
Use the search field on my Home Page to filter through my more than 1,000 articles.