Remove non-ASCII characters from a string in Python

avatar
Borislav Hadzhiev

Last updated: Apr 9, 2024
4 min

banner

# Table of Contents

  1. Remove non-ASCII characters from a string in Python
  2. Remove non-ASCII characters from a string using ord()
  3. Remove non-ASCII characters from a string using encode() and decode()

# Remove non-ASCII characters from a string in Python

To remove the non-ASCII characters from a string:

  1. Use the string.printable attribute to get a string of the ASCII characters.
  2. Use the filter() method to remove the non-ASCII characters.
  3. Use the join() method to join the result into a string.
main.py
import string def remove_non_ascii(a_str): ascii_chars = set(string.printable) return ''.join( filter(lambda x: x in ascii_chars, a_str) ) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0

remove non ascii characters from string

The code for this article is available on GitHub

The string.printable() method returns a string of the ASCII characters that are considered printable.

main.py
import string # 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP # QRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ jk print(string.printable)

The string is a combination of digits, ASCII letters, punctuation and whitespace.

The filter function takes a function and an iterable as arguments and constructs an iterator from the elements of the iterable for which the function returns a truthy value.

The lambda function we passed to filter() gets called with each character in the string and filters out the non-ASCII characters.

main.py
import string def remove_non_ascii(a_str): ascii_chars = set(string.printable) return ''.join( filter(lambda x: x in ascii_chars, a_str) )

Alternatively, you can use the ord() function.

# Remove non-ASCII characters from a string using ord()

This is a three-step process:

  1. Check if each character in the string has a Unicode code point of less than 128.
  2. The first 128 Unicode code points represent the ASCII characters.
  3. Use the join() method to join the matching characters into a string.
main.py
# 👇️ check if each character in a string is ASCII def remove_non_ascii(string): return ''.join( char for char in string if ord(char) < 128 ) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0

remove non ascii characters from string using ord

The code for this article is available on GitHub

The first function uses a generator expression to iterate over the string.

Generator expressions are used to perform some operation for every element or select a subset of elements that meet a condition.
main.py
def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0

On each iteration, we check if the Unicode code point of the character is less than 128.

The first 128 Unicode code points represent the ASCII characters.

The ord function takes a string that represents 1 Unicode character and returns an integer representing the Unicode code point of the given character.

The last step is to join the characters that satisfy the condition.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The string the method is called on is used as the separator between the elements.

Alternatively, you can use the str.encode() and bytes.decode() methods.

# Remove non-ASCII characters from a string using encode() and decode()

This is a three-step process:

  1. Use the str.encode() method to encode the string using the ASCII encoding.
  2. Set the errors argument to ignore, so all non-ASCII characters are dropped.
  3. Use the bytes.decode() method to convert the bytes object to a string.
main.py
def remove_non_ascii(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0 print(ord('a')) # 👉️ 97 print(ord('b')) # 👉️ 98

remove non ascii characters from string using encode decode

The code for this article is available on GitHub

The str.encode() method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

We set the encoding to ascii and the errors argument to ignore.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

All characters that cannot be encoded using the ASCII encoding will get dropped from the string.

The last step is to use the bytes.decode() method to decode the bytes object to a string.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

The bytes.decode() method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non-ASCII characters.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.