Borislav Hadzhiev
Thu Jun 23 2022·2 min read
Photo by Julie Kwak
Use the re.findall()
method to split a string into words and punctuation,
e.g. result = re.findall(r"[\w'\"]+|[,.!?]", my_str)
. The findall()
method
will split the string on whitespace characters and punctuation and will return a
list of the matches.
import re my_str = """One, "Two" Three. Four! Five? I'm """ result = re.findall(r"[\w'\"]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)
The re.findall method takes a pattern and a string as arguments and returns a list of strings containing all non-overlapping matches of the pattern in the string.
The square brackets []
are used to indicate a set of characters.
\w
character matches most characters that can be part of a word in any language, as well as numbers and underscores.If the ASCII
flag is set, the \w
character matches [a-zA-Z0-9_]
.
Our set of characters also includes a single and double quote.
import re my_str = """One, "Two" Three. Four! Five? I'm """ result = re.findall(r"[\w'\"]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)
If you want to exclude single or double quotes from the results, remove the '
and \"
characters from between the square brackets.
The +
matches the preceding character 1 or more times.
The pipe |
character is an OR
. Either match A or B
.
The second set of square brackets matches punctuation - a comma, a dot, an exclamation mark and a question mark.
You can add any other punctuation marks between the square brackets, e.g. a
colon :
, a semicolon ;
, brackets or parenthesis.
You can tweak the regular expression according to your use case. This section of the docs has information regarding what each special character does.
Here is the complete code snippet.
import re my_str = """One, "Two" Three. Four! Five? I'm """ # result = re.findall(r"[\w'\"]+|[,.!?]", my_str) result = re.findall(r"[\w]+|[,.!?]", my_str) # 👇️ ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"] print(result)