Remove non-ASCII characters from a string in Python

avatar

Borislav Hadzhiev

Last updated: Jul 10, 2022

banner

Check out my new book

Remove non-ASCII characters from a string in Python #

To remove the non-ASCII characters from a string:

  1. Check if each character in the string has a unicode code point less than 128.
  2. The first 128 unicode code points represent the ASCII characters.
  3. Use the join() method to join the matching characters into a string.
main.py
# 👇️ check if each character in string is ASCII def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0 # ---------------------------------------------------- # 👇️ encode using ASCII encoding and drop all non-ASCII characters def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0

The first function uses a generator expression to iterate over the string.

Generator expressions are used to perform some operation for every element, or select a subset of elements that meet a condition.
main.py
def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0

On each iteration, we check if the Unicode code point of the character is less than 128.

The first 128 Unicode code points represent the ASCII characters.

The ord function takes a string that represents 1 Unicode character and returns an integer representing the Unicode code point of the given character.

The last step is to join the characters that satisfy the condition.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The string the method is called on is used as the separator between the elements.

Alternatively, you can use the str.encode() and bytes.decode() methods.

To remove the non-ASCII characters from a string:

  1. Use the str.encode() method to encode the string using the ASCII encoding.
  2. Set the errors argument to ignore, so all non-ASCII characters are dropped.
  3. Use the bytes.decode() method to convert the bytes object to a string.
main.py
def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0 print(ord('a')) # 👉️ 97 print(ord('b')) # 👉️ 98

The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

We set the encoding to ascii and the errors argument to ignore.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

All characters that cannot be encoded using the ASCII encoding will get dropped from the string.

The last step is to use the bytes.decode() method to decode the bytes object to a string.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non-ASCII characters.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.