Remove the non utf-8 characters from a String in Python

avatar
Borislav Hadzhiev

Last updated: Feb 20, 2023
2 min

banner

# Remove the non utf-8 characters from a String in Python

To remove the non utf-8 characters from a string:

  1. Use the str.encode() method to encode the string to a bytes object.
  2. Set the errors keyword argument to ignore to drop any non utf-8 characters.
  3. Use the bytes.decode() method to decode the bytes object to a string.
main.py
my_str = '\x86bobbyhadz.com\x86' result = my_str.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(result) # ๐Ÿ‘‰๏ธ 'bobbyhadz.com'

remove non utf 8 characters from string

If you are starting with a bytes object, use the following code sample instead.

main.py
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode( 'utf-8', errors='ignore' ).encode('utf-8') print(result) # ๐Ÿ‘‰๏ธ b'bobbyhadz.com'

starting with bytes object

The example removes the non utf-8 characters from a string.

The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

Any characters that cannot be encoded using the utf-8 encoding will get dropped from the string.

The next step is to decode the bytes object using the utf-8 encoding.

main.py
my_str = 'abc' result = my_str.encode('utf-8', errors='ignore').decode('utf-8') print(result) # ๐Ÿ‘‰๏ธ 'abc'

The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non utf-8 characters.

# Remove the non utf-8 characters from a File

If you need to remove the non-utf-8 characters when reading from a file, use a for loop to iterate over the lines in the file and repeat the same process.

main.py
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: line = line.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(line)

remove non utf 8 characters from file

The code sample assumes that you have an example.txt file located in the same directory as your Python script.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

# Remove the non utf-8 characters when starting with a Bytes object

If you are starting with a bytes object, you have to use the decode() method to decode the bytes object to a string first.

main.py
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8') print(result) # ๐Ÿ‘‰๏ธ b'bobbyhadz.com'

Make sure to set the errors keyword argument to ignore in the call to the decode() method to drop any non-utf-8 characters when converting to a string.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev