Split a string into words and punctuation in Python

avatar

Borislav Hadzhiev

Last updated: Jun 23, 2022

banner

Photo from Unsplash

Split a string into words and punctuation in Python #

Use the re.findall() method to split a string into words and punctuation, e.g. result = re.findall(r"[\w'\"]+|[,.!?]", my_str). The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.

main.py
import re my_str = """One, "Two" Three. Four! Five? I'm """ result = re.findall(r"[\w'\"]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)

The re.findall method takes a pattern and a string as arguments and returns a list of strings containing all non-overlapping matches of the pattern in the string.

The square brackets [] are used to indicate a set of characters.

The \w character matches most characters that can be part of a word in any language, as well as numbers and underscores.

If the ASCII flag is set, the \w character matches [a-zA-Z0-9_].

Our set of characters also includes a single and double quote.

main.py
import re my_str = """One, "Two" Three. Four! Five? I'm """ result = re.findall(r"[\w'\"]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)

If you want to exclude single or double quotes from the results, remove the ' and \" characters from between the square brackets.

The + matches the preceding character 1 or more times.

In other words, it doesn't matter how many characters the word consists of, as long as it only contains characters, numbers, an underscore, single and double quotes, we consider it to be a single match.

The pipe | character is an OR. Either match A or B.

The second set of square brackets matches punctuation - a comma, a dot, an exclamation mark and a question mark.

You can add any other punctuation marks between the square brackets, e.g. a colon :, a semicolon ;, brackets or parentheses.

In its entirety, a match is - one or more characters, numbers, underscores, quotes, or any punctuation mark from the ones we included between the square brackets.

You can tweak the regular expression according to your use case. This section of the docs has information regarding what each special character does.

Here is the complete code snippet.

main.py
import re my_str = """One, "Two" Three. Four! Five? I'm """ # result = re.findall(r"[\w'\"]+|[,.!?]", my_str) result = re.findall(r"[\w]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)
I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.