Remove the HTML tags from a String in Python

avatar

Borislav Hadzhiev

Last updated: Jul 10, 2022

banner

Check out my new book

Remove the HTML tags from a String in Python #

Use the re.sub() method to remove the HTML tags from a string, e.g. result = re.sub(r'<.*?>', '', html_string). The re.sub() method will remove all of the HTML tags in the string by replacing them with empty strings.

main.py
import re html_string = """ <div> <ul> <li>Apple</li> <li>Banana</li> <li>Kiwi</li> </ul> </div> """ result = re.sub(r'<.*?>', '', html_string) # Apple # Banana # Kiwi print(result)

The re.sub method returns a new string that is obtained by replacing the occurrences of the pattern with the provided replacement.

If the pattern isn't found, the string is returned as is.

The first argument we passed to the re.sub() method is a regular expression.

The brackets < and > match the opening and closing characters of an HTML tag.

The dot . matches any character except a newline.

The asterisk * matches 0 or more repetitions of the preceding character (any character).

Adding a question mark ? after the qualifier makes it perform a non-greedy or minimal match.

For example, using the regular expression <.*?> will match only <a>.

In its entirety, the regular expression matches all opening and closing HTML tags.

Alternatively, you can use the HTMLParser class.

Remove the HTML tags from a String using HTMLParser in Python #

To remove the HTML tags from a string in Python:

  1. Extend from the HTMLParser class from the html.parser module.
  2. Implement the handle_data method to get the data between the HTML tags.
  3. Store the data in a list on the class instance.
  4. Call the get_data() method on an instance of the class.
main.py
from html.parser import HTMLParser class HTMLTagsRemover(HTMLParser): def __init__(self): super().__init__(convert_charrefs=False) self.reset() self.convert_charrefs = True self.fed = [] def handle_data(self, data): self.fed.append(data) def handle_entityref(self, name): self.fed.append(f'&{name};') def handle_charref(self, name): self.fed.append(f'&#{name};') def get_data(self): return ''.join(self.fed) def remove_html_tags(value): remover = HTMLTagsRemover() remover.feed(value) remover.close() return remover.get_data() html_string = """ <div> <ul> <li>Apple</li> <li>Banana</li> <li>Kiwi</li> </ul> </div> """ # Apple # Banana # Kiwi print(remove_html_tags(html_string))

We extended from the HTMLParser class. The code snippet is very similar to the one used internally by the django module.

The HTMLParser class is used to find tags and other markup and call handler functions.

The data between the HTML tags is passed from the parser to the derived class by calling self.handle_data().

When convert_charrefs is set to True, character references automatically get converted to the corresponding Unicode character.

If convert_charrefs is set to False, character references are passed by calling the self.handle_entityref() or self.handle_charref() methods.

The get_data() method uses the str.join() method to join the list of strings without a separator.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The remove_html_tags() function takes a string that contains HTML tags and returns a new string where all opening and closing HTML tags have been removed.

main.py
def remove_html_tags(value): remover = HTMLTagsRemover() remover.feed(value) remover.close() return remover.get_data() html_string = """ <div> <ul> <li>Apple</li> <li>Banana</li> <li>Kiwi</li> </ul> </div> """ # Apple # Banana # Kiwi print(remove_html_tags(html_string))

The function instantiates the class and feeds the string containing the html tags to the parser.

The next step is to call the close() method on the instance to handle any buffered data.

Lastly, we call the get_data() method on the instance to join the list of strings into a single string that doesn't contain any HTML tags.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.