Last updated: Apr 9, 2024
Reading timeยท2 min
To remove the non utf-8 characters from a string:
str.encode()
method to encode the string to a bytes object.errors
keyword argument to ignore
to drop any non utf-8
characters.bytes.decode()
method to decode the bytes object to a string.my_str = '\x86bobbyhadz.com\x86' result = my_str.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(result) # ๐๏ธ 'bobbyhadz.com'
If you are starting with a bytes object, use the following code sample instead.
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode( 'utf-8', errors='ignore' ).encode('utf-8') print(result) # ๐๏ธ b'bobbyhadz.com'
The example removes the non utf-8 characters from a string.
The str.encode() method returns an
encoded version of the string as a bytes object. The default encoding is
utf-8
.
errors
keyword argument is set to ignore
, characters that cannot be encoded are dropped.Any characters that cannot be encoded using the utf-8
encoding will get
dropped from the string.
The next step is to decode the bytes
object using the utf-8
encoding.
my_str = 'abc' result = my_str.encode('utf-8', errors='ignore').decode('utf-8') print(result) # ๐๏ธ 'abc'
The bytes.decode() method returns a
string decoded from the given bytes. The default encoding is utf-8
.
The result is a string that doesn't contain any non utf-8 characters.
If you need to remove the non-utf-8 characters when reading from a file, use a
for
loop to iterate over the lines in the file and repeat the same process.
with open('example.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: line = line.encode( 'utf-8', errors='ignore' ).decode('utf-8') print(line)
The code sample assumes that you have an example.txt
file located in the same
directory as your Python script.
string
to a bytes
object and decoding is the process of converting a bytes
object to a string
.If you are starting with a bytes
object, you have to use the decode()
method
to decode the bytes object to a string first.
my_bytes = 'bobbyhadz.com'.encode('utf-8') result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8') print(result) # ๐๏ธ b'bobbyhadz.com'
Make sure to set the errors
keyword argument to ignore
in the call to the
decode()
method to drop any non-utf-8 characters when converting to a string.
You can learn more about the related topics by checking out the following tutorials: