Remove the non utf-8 characters from a String in Python

# Remove the non utf-8 characters from a String in Python

To remove the non utf-8 characters from a string:

Use the str.encode() method to encode the string to a bytes object.
Set the errors keyword argument to ignore to drop any non utf-8 characters.
Use the bytes.decode() method to decode the bytes object to a string.

main.py

Copied!
my_str = '\x86bobbyhadz.com\x86'

result = my_str.encode(
    'utf-8', errors='ignore'
).decode('utf-8')

print(result)  # 👉️ 'bobbyhadz.com'

remove non utf 8 characters from string

The code for this article is available on GitHub

If you are starting with a bytes object, use the following code sample instead.

main.py

Copied!
my_bytes = 'bobbyhadz.com'.encode('utf-8')

result = my_bytes.decode(
  'utf-8', errors='ignore'
).encode('utf-8')

print(result)  # 👉️ b'bobbyhadz.com'

starting with bytes object

The example removes the non utf-8 characters from a string.

The str.encode() method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

Any characters that cannot be encoded using the utf-8 encoding will get dropped from the string.

The next step is to decode the bytes object using the utf-8 encoding.

main.py

Copied!
my_str = 'abc'

result = my_str.encode('utf-8', errors='ignore').decode('utf-8')
print(result)  # 👉️ 'abc'

The bytes.decode() method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non utf-8 characters.

# Remove the non utf-8 characters from a File

If you need to remove the non-utf-8 characters when reading from a file, use a for loop to iterate over the lines in the file and repeat the same process.

main.py

Copied!
with open('example.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()

    for line in lines:
        line = line.encode(
          'utf-8', errors='ignore'
        ).decode('utf-8')

        print(line)

remove non utf 8 characters from file

The code for this article is available on GitHub

The code sample assumes that you have an example.txt file located in the same directory as your Python script.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

# Remove the non utf-8 characters when starting with a Bytes object

If you are starting with a bytes object, you have to use the decode() method to decode the bytes object to a string first.

main.py

Copied!
my_bytes = 'bobbyhadz.com'.encode('utf-8')

result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8')

print(result)  # 👉️ b'bobbyhadz.com'

The code for this article is available on GitHub

Make sure to set the errors keyword argument to ignore in the call to the decode() method to drop any non-utf-8 characters when converting to a string.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.

You can use the search field on my Home Page to filter through all of my articles.

Remove the non utf-8 characters from a String in Python