Remove URLs from Text in Python

avatar

Borislav Hadzhiev

Last updated: Jul 11, 2022

banner

Photo from Unsplash

Remove URLs from Text in Python #

Use the re.sub() method to remove URLs from text, e.g. result = re.sub(r'http\S+', '', my_string). The re.sub() method will remove any URLs from the string by replacing them with empty strings.

main.py
import re my_string = """ First https://example.com https://google.com Second Third https://example.com """ result = re.sub(r'http\S+', '', my_string) # First # Second # Third print(result)

We used the re.sub() method to remove all URLs from a string.

The re.sub method returns a new string that is obtained by replacing the occurrences of the pattern with the provided replacement.

main.py
import re my_str = '1apple, 2apple, 3banana' result = re.sub(r'[0-9]', '_', my_str) print(result) # 👉️ _apple, _apple, _banana

If the pattern isn't found, the string is returned as is.

We used an empty string for the replacement because we want to remove all URLs from the string.
main.py
import re my_string = """ First https://example.com https://google.com Second Third https://example.com """ result = re.sub(r'http\S+', '', my_string) # First # Second # Third print(result)

The first argument we called the re.sub() method with is a regular expression.

The http characters in the regex match the literal characters.

\S matches any character that is not a whitespace character. Notice that the S is uppercase.

The plus + matches the preceding character (any non-whitespace character) 1 or more times.

In its entirety, the regular expression matches substrings starting with http follows by 1 or more non-whitespace characters.

If you worry about matching strings in the form of http-something, update your regular expression to r'https?://\S+'.

main.py
import re my_string = """ First https://example.com https://google.com Second Third https://example.com """ result = re.sub(r'https?://\S+', '', my_string) # First # Second # Third print(result)
The question mark ? causes the regular expression to match 0 or 1 repetitions of the preceding character.

For example, https? will match either https or http.

We then have the colon and two forward slashes :// to complete the protocol.

In its entirety, the regular expression matches substrings starting with http:// or https:// followed by 1 or more non-whitespace characters.

If you ever need help reading or writing a regular expression, consult the regular expression syntax subheading in the official docs.

The page contains a list of all of the special characters with many useful examples.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.