Last updated: Apr 11, 2024
Reading timeยท4 min
lxml
parser for speedhtml5lib
parser to solve the errorThe BeautifulSoup error "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?" occurs when you try to use a parser library without installing it first.
You can solve the error by using the built-in html.parser
parser.
Here is an example of how the error occurs.
from bs4 import BeautifulSoup markup = """<html><head><title>Example html doc</title></head> <body> <p class="title"><b>bobbyhadz.com</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # โ๏ธ bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? soup = BeautifulSoup(markup, "lxml")
The first argument we passed to the BeautifulSoup
class is the markup string
and the second is the parser.
We used the lxml
parser, however, we haven't installed the module.
One way to solve the error is to use the built-in html.parser
parser.
from bs4 import BeautifulSoup markup = """<html><head><title>Example html doc</title></head> <body> <p class="title"><b>bobbyhadz.com</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(markup, "html.parser") print(soup.title) # ๐๏ธ <title>Example html doc</title> print('-' * 50) print(soup.title.name) # ๐๏ธ title print('-' * 50) print(soup.p) # ๐๏ธ <p class="title"><b>bobbyhadz.com</b></p> print('-' * 50) print(soup.find_all('a'))
We set the parser to html.parser
- BeautifulSoup(markup, "html.parser")
.
The advantages of using the html.parser
parser are:
The disadvantages of the html.parser
parser are:
lxml
parser.html5lib
parser.Make sure you have the beautifulsoup4 module installed to be able to run the code sample.
You can click on the link and follow the instructions.
Depending on your environment, you might have to install:
lxml
parser for speedIf you need to optimize for speed, you can use the lxml
parser.
Open your terminal and run the following command.
pip install lxml pip3 install lxml
Alternatively, you can follow the installation instructions in the following article.
You can use the lxml
parser as follows.
from bs4 import BeautifulSoup markup = """<html><head><title>Example html doc</title></head> <body> <p class="title"><b>bobbyhadz.com</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(markup, "lxml") print(soup.title) print('-' * 50) print(soup.title.name) print('-' * 50) print(soup.p) print('-' * 50) print(soup.find_all('a'))
Notice that we passed the string "lxml"
as the second argument to the
BeautifulSoup
class.
The lxml
parser has the following advantages:
However, the lxml
parser is an external C dependency and has to be installed
separately.
html5lib
parser to solve the errorAlternatively, you can use the html5lib
parser to solve the error.
Open your terminal in your project's root directory and install the html5lib
module.
pip install html5lib # Or with pip3 pip3 install html5lib
Now use the parser as follows.
from bs4 import BeautifulSoup markup = """<html><head><title>Example html doc</title></head> <body> <p class="title"><b>bobbyhadz.com</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(markup, "html5lib") print(soup.title) print('-' * 50) print(soup.title.name) print('-' * 50) print(soup.p) print('-' * 50) print(soup.find_all('a'))
We passed the "html5lib"
string as the second argument to the BeautifulSoup
class.
The html5lib
parser has the following advantages:
The html5lib
has the following disadvantages:
If you forget to install the html5lib
parser, you will get the following
error:
In general, you will want to optimize for speed and thus use the lxml
parser.
If you don't want to install an external dependency, use the built-in
html.parser
.
Note that the lxml
parser is used for parsing HTML.
If you need to parse XML, you have to set the parser to xml
or lxml-xml
.
from bs4 import BeautifulSoup markup = """<html><head><title>Example html doc</title></head> <body> <p class="title"><b>bobbyhadz.com</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(markup, 'xml') print(soup.title) print('-' * 50) print(soup.title.name) print('-' * 50) print(soup.p) print('-' * 50) print(soup.find_all('a'))
The xml
parser has the following advantages:
However, the xml
parser is an external C dependency.
You can learn more about the related topics by checking out the following tutorials: