Remove the non utf-8 characters from a String in Python

avatar

Borislav Hadzhiev

Last updated: Aug 14, 2022

banner

Photo from Unsplash

Remove the non utf-8 characters from a String in Python #

To remove the non utf-8 characters from a string:

  1. Use the str.encode() method to encode the string to a bytes object.
  2. Set the errors keyword argument to ignore to drop any non utf-8 characters.
  3. Use the bytes.decode() method to decode the bytes object to a string.
main.py
# ✅ remove non utf-8 characters from string my_str = 'abc' result = my_str.encode('utf-8', errors='ignore').decode('utf-8') print(result) # 👉️ 'abc' # ---------------------------------------------------- # ✅ remove non utf-8 characters when reading from file with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: line = line.encode('utf-8', errors='ignore').decode('utf-8') print(line) # ---------------------------------------------------- # ✅ if you are starting with a bytes object my_bytes = 'abc'.encode('utf-8') result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8') print(result) # 👉️ b'abc'

The first example removes the non utf-8 characters from a string.

The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

Any characters that cannot be encoded using the utf-8 encoding will get dropped from the string.

The next step is to decode the bytes object using the utf-8 encoding.

main.py
my_str = 'abc' result = my_str.encode('utf-8', errors='ignore').decode('utf-8') print(result) # 👉️ 'abc'

The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non-utf-8 characters.

If you need to remove the non-utf-8 characters when reading from a file, use a for loop to iterate over the lines in the file and repeat the same process.

main.py
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: line = line.encode('utf-8', errors='ignore').decode('utf-8') print(line)
Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

If you are starting with a bytes object, you have to use the decode() method to decode the bytes object to a string first.

main.py
my_bytes = 'abc'.encode('utf-8') result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8') print(result) # 👉️ b'abc'

Make sure to set the errors keyword argument to ignore in the call to the decode() method to drop any non-utf-8 characters when converting to a string.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.