Last updated: Apr 9, 2024
Reading timeยท2 min

To remove the non utf-8 characters from a string:
str.encode() method to encode the string to a bytes object.errors keyword argument to ignore to drop any non utf-8
characters.bytes.decode() method to decode the bytes object to a string.my_str = '\x86bobbyhadz.com\x86' result = my_str.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(result) # ๐๏ธ 'bobbyhadz.com'

If you are starting with a bytes object, use the following code sample instead.
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode( 'utf-8', errors='ignore' ).encode('utf-8') print(result) # ๐๏ธ b'bobbyhadz.com'

The example removes the non utf-8 characters from a string.
The str.encode() method returns an
encoded version of the string as a bytes object. The default encoding is
utf-8.
errors keyword argument is set to ignore, characters that cannot be encoded are dropped.Any characters that cannot be encoded using the utf-8 encoding will get
dropped from the string.
The next step is to decode the bytes object using the utf-8 encoding.
my_str = 'abc' result = my_str.encode('utf-8', errors='ignore').decode('utf-8') print(result) # ๐๏ธ 'abc'
The bytes.decode() method returns a
string decoded from the given bytes. The default encoding is utf-8.
The result is a string that doesn't contain any non utf-8 characters.
If you need to remove the non-utf-8 characters when reading from a file, use a
for loop to iterate over the lines in the file and repeat the same process.
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: line = line.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(line)

The code sample assumes that you have an example.txt file located in the same
directory as your Python script.
string to a bytes object and decoding is the process of converting a bytes object to a string.If you are starting with a bytes object, you have to use the decode() method
to decode the bytes object to a string first.
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8') print(result) # ๐๏ธ b'bobbyhadz.com'
Make sure to set the errors keyword argument to ignore in the call to the
decode() method to drop any non-utf-8 characters when converting to a string.
You can learn more about the related topics by checking out the following tutorials: