Last updated: Apr 9, 2024
Reading time·4 min
unicodedata
Use the unidecode
package to remove the accents from a string.
The unidecode()
function will remove all the accents from the string by
replacing the characters with characters that can safely be encoded to ASCII.
The first thing you should do is install the unidecode package.
pip install Unidecode # 👇️ or with pip3 pip3 install Unidecode
Now you can import and use the unidecode
function.
from unidecode import unidecode str_with_accents = 'ÂéüÒÑ' str_without_accents = unidecode(str_with_accents) print(str_without_accents) # 👉️ 'AeuON'
unidecode
function takes a string that possibly contains non-ASCII characters and returns a string that can safely be encoded to ASCII.If your string contains characters that unidecode
cannot translate to
ASCII-compatible characters, the function replaces them with empty strings.
from unidecode import unidecode str_with_accents = 'ÂéüÒÑ\ue123' str_without_accents = unidecode(str_with_accents) print(str_without_accents) # 👉️ 'AeuON'
Notice that the \ue123
character couldn't get converted to an ASCII-compatible
character and got dropped from the string.
If you need to remove the accents from a list of strings, use a list comprehension.
from unidecode import unidecode names = ['Renée', 'Noël', 'Sørina', 'Adrián', 'Zoë'] names_without_accents = [ unidecode(name) for name in names ] # 👇️ ['Renee', 'Noel', 'Sorina', 'Adrian', 'Zoe'] print(names_without_accents)
List comprehensions are used to perform some operation for every element or select a subset of elements that meet a condition.
On each iteration, we use the unidecode()
method to remove the accents from
the current list item and return the result.
The strings in the new list don't contain any accents.
If you want to raise an error if the unidecode
function encounters a character
it cannot translate to an ASCII-compatible character, set the
errors keyword argument
to strict
.
from unidecode import unidecode str_with_accents = 'ÂéüÒÑ\ue123' # ⛔️ unidecode.UnidecodeError: no replacement found for character '\ue123' in position 5 str_without_accents = unidecode(str_with_accents, errors='strict')
The unidecode
function found no replacement for the \ue123
character, so it
raised an error.
The unidecode
package exposes a UnidecodeError
object that gives us access
to the index of the character that couldn't get translated.
from unidecode import unidecode, UnidecodeError str_with_accents = 'ÂéüÒÑ\ue123' # ⛔️ unidecode.UnidecodeError: no replacement found for character '\ue123' in position 5 try: str_without_accents = unidecode(str_with_accents, errors='strict') except UnidecodeError as e: print(e.index) # 👉️ 5
The character at index 5
raised the error.
You can also set the errors
keyword argument to replace
to replace the
character that cannot be translated with another string.
from unidecode import unidecode str_with_accents = 'ÂéüÒÑ\ue123' str_without_accents = unidecode( str_with_accents, errors='replace', replace_str='?' ) print(str_without_accents) # 👉️ 'AeuON?'
replace_str
keyword argument is used to specify the replacement string.You can use the preserve
keyword argument if you want to preserve the
characters that cannot be translated to ASCII-compatible characters.
from unidecode import unidecode str_with_accents = 'ÂéüÒÑ\ue123' str_without_accents = unidecode( str_with_accents, errors='preserve', ) print(str_without_accents) # 👉️ 'AeuON'
However, if errors
is set to preserve
, the unidecode
function doesn't
produce an ASCII-compatible string.
unicodedata
You can also use the built-in unicodedata module to remove the accents from a string.
import unicodedata def remove_accents(string): return ''.join(char for char in unicodedata.normalize('NFD', string) if unicodedata.category(char) != 'Mn') str_with_accents = 'ÂéüÒÑ' print(remove_accents(str_with_accents)) # 👉️ AeuON # 👇️ Noel, Adrian, Sørina, Zoe, Renee print(remove_accents('Noël, Adrián, Sørina, Zoë, Renée'))
The unicodatata
module is a built-in Python module, so you don't have to
install anything.
The code sample uses a generator expression to iterate over the characters of the string.
The unicodedata.normalize() method returns the normal form for the given string.
The first argument is the form
- NFD
in our case. The normal form NFD
translates each character into its decomposed form.
import unicodedata str_with_accents = 'ÂéüÒÑ' result = list((char for char in unicodedata.normalize('NFD', str_with_accents) if unicodedata.category(char) != 'Mn')) print(result) # 👉️ ['A', 'e', 'u', 'O', 'N']
The unicodedata.category() method takes a character as a parameter and returns the general category assigned to the character.
import unicodedata str_with_accents = 'aeÂéüÒÑ' print(unicodedata.category(str_with_accents[0])) # Ll print(unicodedata.category(str_with_accents[1])) # Ll print(unicodedata.category(str_with_accents[2])) # Lu print(unicodedata.category(str_with_accents[3])) # Ll
The Mn
character category is a non-spacing combining mark.
You can learn more about the related topics by checking out the following tutorials: