Remove zero width space character from string in Python

avatar

Borislav Hadzhiev

Last updated: Jul 10, 2022

banner

Check out my new book

Remove zero width space character from string in Python #

Use the str.replace() method to remove zero width space characters from a string, e.g. result = my_str.replace('\u200c', ''). The str.replace() method will remove occurrences of the '\u200c' character by replacing them with an empty string.

main.py
my_str = '\u200chello world' # ✅ remove zero width space character with str.replace() result = my_str.replace('\u200c', '') print(result) # 👉️ 'hello world' print(repr(result)) # 👉️ 'hello world' # ------------------------------------- # ✅ remove zero width space character with encode() and decode() result = my_str.encode('ascii', errors='ignore').decode() print(result) # 👉️ 'hello world' print(repr(result)) # 👉️ 'hello world'
The first example uses the str.replace() method to remove the Unicode zero width non-joiner \u200c characters from the string.

The same approach can be used to remove Unicode zero width space characters \u200b.

main.py
my_str = '\u200bhello world' result = my_str.replace('\u200b', '') print(result) # 👉️ 'hello world' print(repr(result)) # 👉️ 'hello world'

The str.replace method returns a copy of the string with all occurrences of a substring replaced by the provided replacement.

The method takes the following parameters:

NameDescription
oldThe substring we want to replace in the string
newThe replacement for each occurrence of old
countOnly the first count occurrences are replaced (optional)
Note that the method doesn't change the original string. Strings are immutable in Python.

Alternatively, you can use the str.encode() and bytes.decode() methods.

To remove zero width space unicode character from string:

  1. Use the str.encode() method to encode the string using the ASCII encoding.
  2. Set the errors argument to ignore, so all non-ASCII characters are dropped.
  3. Use the bytes.decode() method to convert the bytes object to a string.
main.py
my_str = '\u200chello world' result = my_str.encode('ascii', errors='ignore').decode() print(result) # 👉️ 'hello world' print(repr(result)) # 👉️ 'hello world'

The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

We set the encoding to ascii and the errors argument to ignore.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

All characters that cannot be encoded using the ASCII encoding will get dropped from the string.

The last step is to use the bytes.decode() method to decode the bytes object to a string.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8.

When using this approach, be aware that it removes non zero space Unicode characters, but it also removes all non-ascii characters.

If your string contains other non-ASCII characters that you want to keep, use the str.replace() method instead.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.