Split a string into list of words without punctuation in Python

avatar

Borislav Hadzhiev

Last updated: Aug 31, 2022

banner

Photo from Unsplash

Split a string into list of words without punctuation in Python #

Use the re.findall() method to split a string into a list of words without punctuation, e.g. my_list = re.findall(r'[\w]+', my_str). The re.findall() method will split the string on each occurrence of a word and will return a list containing the words without punctuation.

main.py
import re # ✅ split string into list of words without punctuation (re.findall()) my_str = 'One, two. three! four? @five' my_list = re.findall(r'[\w]+', my_str) print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five'] # ------------------------------------------- # ✅ split string into list of words without punctuation (re.split()) my_list = [item for item in re.split(r'\W+', my_str) if item] print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five'] # ------------------------------------------- # ✅ split string into list of words without punctuation (str.replace()) my_list = my_str.replace(',', '').replace( '.', '').replace('!', '').replace('?', '').replace('@', '').split() print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five']

The first example uses the re.findall() method to split a string into a list of words without punctuation.

The re.findall method takes a pattern and a string as arguments and returns a list of strings containing all non-overlapping matches of the pattern in the string.

The first argument we passed to the re.findall() method is a regular expression.

main.py
import re my_str = 'One, two. three! four? @five' my_list = re.findall(r'[\w]+', my_str) print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five']

The square [] brackets are used to indicate a set of characters.

The \w character matches Unicode word characters and includes most characters that can be part of a word in any language.

The plus + causes the regular expression to match 1 or more repetitions of the preceding character (the Unicode characters).

The re.findall() method returns a list containing the words in the string without any punctuation.

If you ever need help reading or writing a regular expression, consult the regular expression syntax subheading in the official docs.

The page contains a list of all of the special characters with many useful examples.

If you need a more flexible approach, you can use the str.replace() method to remove the punctuation characters before splitting on whitespace characters.

Split a string into list of words without punctuation using str.replace() #

To split a string into a list of words without punctuation:

  1. Use the str.replace() method to remove the punctuation from the string.
  2. Use the str.split() method to split on whitespace characters.
  3. The new list will contain the words in the string without punctuation.
main.py
my_str = 'One, two. three! four? @five' my_list = my_str.replace(',', '').replace( '.', '').replace('!', '').replace('?', '').replace('@', '').split() print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five']

We used the str.replace() method to remove all punctuation before splitting on whitespace characters.

The str.replace method returns a copy of the string with all occurrences of a substring replaced by the provided replacement.

The method takes the following parameters:

NameDescription
oldThe substring we want to replace in the string
newThe replacement for each occurrence of old
countOnly the first count occurrences are replaced (optional)

The str.replace() method doesn't change the original string. Strings are immutable in Python.

We used an empty string for the replacement because we want to remove punctuation characters.

You can chain as many calls to the str.replace() method as necessary.

The str.split() method splits the string into a list of substrings using a delimiter.

When no separator is passed to the str.split() method, it splits the input string on one or more whitespace characters.
main.py
print('a b c d'.split()) # 👉️ ['a', 'b', 'c', 'd']

If the separator is not found in the string, a list containing only 1 element is returned.

Split a string into list of words without punctuation using re.split() #

Use the re.split() method to split a string into a list of words without punctuation, e.g. re.split(r'\W+', my_str). The re.split() method will split the string into a list of words without keeping punctuation.

main.py
import re my_str = 'One, two. three! four? @five' my_list = [item for item in re.split(r'\W+', my_str) if item] print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five']

The re.split method splits a string on all occurrences of the provided pattern.

The first argument we passed to the method is a regular expression.

The \W (capital W) special character matches any character that is not a word character.

The plus + causes the regular expression to match 1 or more repetitions of the preceding character (any non-word characters).

We end up splitting the string on all occurrences of non-word characters.

We used a list comprehension to remove any empty strings from the result.

You might get empty string values in the list if the string starts with or ends with punctuation.

main.py
import re my_str = '!One, two. three! four? @five?' result = re.split(r'\W+', my_str) print(result) # 👉️ ['', 'One', 'two', 'three', 'four', 'five', '']

The list comprehension checks if each item in the list is truthy and returns the result.

main.py
import re my_str = 'One, two. three! four? @five' my_list = [item for item in re.split(r'\W+', my_str) if item] print(my_list) # 👉️ ['One', 'two', 'three', 'four', 'five']

Empty strings are falsy values, so they don't get added to the new list.

List comprehensions are used to perform some operation for every element or select a subset of elements that meet a condition.

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.