How to Remove \ufeff from a String in Python

avatar
Borislav Hadzhiev

Last updated: Apr 9, 2024
3 min

banner

# Remove \ufeff from a string in Python

Use the str.replace() method to remove \ufeff BOM character from a string.

The replace() method will remove the \ufeff character from the string by replacing it with an empty string.

main.py
my_str = '\ufefffirst line' result = my_str.replace('\ufeff', '') print(repr(result)) # ๐Ÿ‘‰๏ธ 'first line'

remove ufeff from string

The code for this article is available on GitHub

The \ufeff character is a byte order mark (BOM) and is interpreted as a zero-width non-breaking space.

The BOM character causes an issue when we use an incorrect codec to decode bytes that were encoded using a different codec.

If you have a string that contains a BOM character, use the str.replace() method to remove it.

main.py
my_str = '\ufefffirst line' result = my_str.replace('\ufeff', '') print(repr(result)) # ๐Ÿ‘‰๏ธ 'first line'

The str.replace method returns a copy of the string with all occurrences of a substring replaced by the provided replacement.

The method takes the following parameters:

NameDescription
oldThe substring we want to replace in the string
newThe replacement for each occurrence of old
countOnly the first count occurrences are replaced (optional)

The method doesn't change the original string. Strings are immutable in Python.

# Set the encoding to utf-8-sig when opening a file

If you got the error "UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff'" when trying to read from a file, explicitly set the encoding keyword argument to utf-8-sig.

main.py
with open('example.txt', 'r', encoding='utf-8-sig') as f: lines = f.readlines() print(lines)

set encoding to utf 8 sig when opening file

The code for this article is available on GitHub

The open() function takes an encoding keyword argument, which can be set to utf-8-sig to treat the byte order mark as metadata instead of a string.

When decoding, the utf-8-sig codec skips the BOM byte if it appears as the first byte in the file.

When using the utf-8 encoding, the use of the byte order mark (BOM) is discouraged and should be avoided.

# The \ufeff character should automatically get removed

The \ufeff Unicode character should automatically get removed if you decode the bytes using the correct encoding.

For example, the utf-8-sig encoding is used to encode with BOM.

main.py
my_bytes = 'bobbyhadz.com'.encode('utf-8-sig') print(my_bytes) # ๐Ÿ‘‰๏ธ b'\xef\xbb\xbfbobbyhadz.com' my_str = my_bytes.decode('utf-8-sig') print(my_str) # ๐Ÿ‘‰๏ธ bobbyhadz.com

encoding with bom

The code for this article is available on GitHub

If you decode the bytes with the same encoding, the \ufeff character gets automatically removed.

The utf-16 encoding also encodes with BOM.

main.py
my_bytes = 'bobbyhadz.com'.encode('utf-16') # ๐Ÿ‘‡๏ธ b'\xff\xfeb\x00o\x00b\x00b\x00y\x00h\x00a\x00d\x00z\x00.\x00c\x00o\x00m\x00' print(my_bytes) my_str = my_bytes.decode('utf-16') print(my_str) # ๐Ÿ‘‰๏ธ bobbyhadz.com

using utf 16 encoding

On the other hand, the utf-8 encoding encodes without BOM.

main.py
my_bytes = 'bobbyhadz.com'.encode('utf-8') print(my_bytes) # ๐Ÿ‘‰๏ธ b'bobbyhadz.com' my_str = my_bytes.decode('utf-8') print(my_str) # ๐Ÿ‘‰๏ธ bobbyhadz.com

The Byte Order Mark should only appear at the start of a document and should automatically get removed when the correct encoding is used.

If you don't know the correct encoding, use the str.replace() method to remove the character from the string.

main.py
my_str = '\ufeffbobbyhadz.com' result = my_str.replace('\ufeff', '') print(repr(result)) # ๐Ÿ‘‰๏ธ 'bobbyhadz.com'
The code for this article is available on GitHub

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev