ParseError: not well-formed (invalid token) [Solved]

avatar
Borislav Hadzhiev

Last updated: Apr 11, 2024
2 min

banner

# ParseError: not well-formed (invalid token) [Solved]

The error "xml.etree.ElementTree.ParseError: not well-formed (invalid token)" occurs when the string you are trying to parse contains an invalid character.

To solve the error, ignore the errors by using the XMLParser class from the lxml module and setting recover to True.

Here is an example of how the error occurs.

main.py
import xml.etree.ElementTree as ET data = """ <body> <p>bobbyhadz.com \x08</p> </body>""" # โ›”๏ธ xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 19 print(ET.XML(data))

xml etree element tree parser error not well formed

The \x08 character is invalid which caused the error.

One way to get around this is to use the XMLParser class from the lxml module.

First, install the lxml module by issuing the following command.

shell
pip install lxml # or with pip3 pip3 install lxml

Now, import and use the module as follows.

main.py
from lxml import etree data = """ <body> <p>bobbyhadz.com \x08</p> </body>""" parser = etree.XMLParser(recover=True) root = etree.fromstring(data, parser=parser) print(root.tag) # ๐Ÿ‘‰๏ธ body # b'<body>\n <p>bobbyhadz.com </p>\n</body>' print(etree.tostring(root)) print(root.find('p').tag) # ๐Ÿ‘‰๏ธ p

ignore errors when parsing

The code for this article is available on GitHub

Make sure to set the recover argument to True, so the parser ignores the errors that were caused by trying to parse invalid characters.

# Removing the invalid characters to solve the error

You can also solve the error by removing the invalid characters.

First, use the repr() method to print the string and make note of the invalid characters.

main.py
import xml.etree.ElementTree as ET data = """ <body> <p>bobbyhadz.com \x08</p> </body>""" # '\n<body>\n <p>bobbyhadz.com \x08</p>\n</body>' print(repr(data))
The code for this article is available on GitHub

The string contains a \x08 that we have to remove.

We can use the str.replace() method to remove the character from the string.

main.py
import xml.etree.ElementTree as ET data = """ <body> <p>bobbyhadz.com \x08</p> </body>""" data = data.replace('\x08', '') root = ET.XML(data) print(root.tag) # b'<body>\n <p>bobbyhadz.com </p>\n</body>' print(ET.tostring(root)) print(root.find('p').tag) # ๐Ÿ‘‰๏ธ p

The str.replace() method returns a copy of the string with all occurrences of a substring replaced by the provided replacement.

The method takes the following parameters:

NameDescription
oldThe substring we want to replace in the string
newThe replacement for each occurrence of old
countOnly the first count occurrences are replaced (optional)

We remove all occurrences of the character by replacing them with empty strings.

You might have to call the replace() method multiple times if your string contains multiple invalid characters.

main.py
import xml.etree.ElementTree as ET data = """ <body> <p>\x0c bobbyhadz.com \x08</p> </body>""" data = data.replace('\x08', '').replace('\x0c', '') root = ET.XML(data) print(root.tag) # b'<body>\n <p>bobbyhadz.com </p>\n</body>' print(ET.tostring(root)) print(root.find('p').tag) # ๐Ÿ‘‰๏ธ p
The code for this article is available on GitHub

The first call to the str.replace() method removes the \x08 character and the second removes the \x0c character.

The str.replace() method returns a new string with the matches replaced, so you can chain as many calls to the method as necessary.

Once you remove all invalid characters, you can safely use the ET.XML() method.

The method parses an XML document from a string constant.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev