Introduction to Regular Expression (RegEx) with the Python re Package

Regular expression (RegEx) is a common computer language used for searching for specific text within text strings based on algebraic notation. It can be used to extract text based on search matches from a single document but is very useful in natural language processing (NLP) applications where a corpus (a collection of texts) needs to be searched.

This post provides an introduction to RegEx with with support of the Python re package. Click here for the Python code. We will be using the poem Twinkle, Twinkle Little Star by Jane Taylor as our sample text string.

text = 'Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky. When the blazing sun is gone, When he nothing shines upon, Then you show your little light, Twinkle, twinkle, all the night. Then the traveler in the dark Thanks you for your tiny spark, How could he see where to go, If you did not twinkle so? In the dark blue sky you keep, Often through my curtains peep For you never shut your eye, Till the sun is in the sky. As your bright and tiny spark Lights the traveler in the dark, Though I know not what you are, Twinkle, twinkle, little star.'

re Functions You Need to Know:

Here are some handy functions to know to search for a string match:

re.search – returns a match if it exists in the text string
re.findall – returns a list of the matches if it exists in the text string
re.split – returns a list where the text string is split at each match
re.sub – subsitutes a match in the text string for a new string

Finding a Word:

The simplest RegEx is to match on a word. Let’s return a list of all instances of “twinkle” using the re.findall function:

re.findall(r"twinkle", text)
output: ['twinkle', 'twinkle', 'twinkle', 'twinkle']

Note: ‘r’ indicates that the text is a raw string meaning that it treats the backslash (\) as a literal character versus other uses like newline.

You will notice this only returns the instances of “twinkle” where “t” is lowercase. This is because RegEx is case-sensitive, so if you want both “Twinkle” and “twinkle” you have to indicate the upper and lowercase within square brackets which are typically used to indicate a set of characters. In the case below, the square brackets indicates the set of all upper and lowercase “t”.

re.findall(r"[Tt]winkle", text)
output: ['Twinkle', 'twinkle', 'Twinkle', 'twinkle', 'twinkle', 'Twinkle', 'twinkle']

Now we have all seven instances of “twinkle” from the poem being returned.

Searching for Sets of Characters

The square brackets are useful for searching for a set of characters or numbers that may not be sequential in a text string. For example, if we want to find all instances of the letters “b” and “c” disjointly we would use [bc]:

re.findall(r"[bc]", text)
output: ['b', 'b', 'c', 'b', 'c', 'b']

What if we want to find all letters of the alphabet or all single digits? Or all uppercase or all lowercase letters? We wouldn’t want to type the entire alphabet or digits 0-9 in brackets now would we? Luckily there is a shorthand where we can use [A-Z] for all uppercase letters, [a-z] for all lowercase letters and [0-9] for all digits. If we want all letters we can use [a-zA-Z]. Let’s extract all uppercase letters from the poem:

re.findall(r"[A-Z]", text)
output: ['T','H','I','U','L','W','W','T','T','T','T','H','I','I','O','F','T','A','L','T','I', 'T']

Special Characters

Within RegEx, some characters have special meaning depending on how they are used.

The caret (^) symbol, when used within square brackets, if it is the first character used after the open bracket, it will match on any character except the specified characters. When the caret is at the start of the line and not within the square brackets, it is used to match the start of the line. If the caret is used anywhere else, then it’s just a caret symbol. What would happen if we used [^Tt]?

re.findall(r"[^Tt]", text)
output: ['w','i','n','k','l','e',',','','w','i','n','k','l','e',',',','l','i','l','e',' ',' 's','a','r....]

[^Tt] returns all characters except for all instances of upper and lowercase ‘t’.

And if we wanted to find what letter was at the start of the string we could use ^[A-Za-z]:

re.findall(r"^[A-Za-z]", text)
output: ['T']

Similar to how the caret can tell us what a string starts with, the dollar sign ($) tells us what a string ends with. Both the caret and dollar sign are what are referred to as anchors – special characters that anchor RegEx to a specific part of the text string. Let’s check and see if the poem ends with “star.”:

re.findall(r"star\.$", text)
output: ['star.']

You might notice in the previous line of code that the period is preceeded by a backslash. This is because the period (.) is also a special character and the backslash (believe it or not – another special character) negates the special effect so that it is treated as a normal period. When used as a special character, the period returns any specified matching character plus the in between characters, based on the number of periods in the RegEx.

For example, we could run [Tt]…kle and return a list of all instance of “twinkle” from the text string. This works because we are omitting three characters from the actual word, but we are telling RegEx to return any charaters that start with the match “T” or “t”, that end in “kle” and have any three characters in the middle:

re.findall(r"[Tt]...kle", text)
output: ['Twinkle', 'twinkle', 'Twinkle', 'twinkle', 'twinkle', 'Twinkle', 'twinkle']

If we broaden the requirements of the RegEx to [Tt]….le – replace “k” with a period, we now have other words appear in our output, because it will find all characters that match the RegEx rules:

re.findall(r"[Tt]....le", text)
output: ['Twinkle','twinkle','Twinkle','twinkle','travele','twinkle','travele','Twinkle','twinkle']

Another special character that provides us with optionality is the question mark (?). The question mark means one or zero occurences of the previous character. This is useful for words that might be plural and might have an “s” at the end in some instances but might be singular in another. From the poem, we can extract the word “Lights” as such:

re.findall(r"Lights?", text)
output: ['Lights']

Now what if we had the text: “Lights and light”?

re.findall(r"[Ll]ights?", "Lights and light")
output: ['Lights', 'light']

Here, were are able to match both the singular and plural forms of the word by using the question mark for optionality.

These were just some examples of special characters. Here are some other special characters for you to try out:

*For zero or more occurrences
+For one or more occurrences
{}For the exact specified number of occurrences
|Either or

Special Sequences

In addition to special characters, there are also special sequences indicated by a “\” followed by one of the characters below:

\bmatches where the specified characters are at the beginning or at the end of a word
\Bmatches where the specified characters are present, but NOT at the beginning
\dmatches where the string contains digits (numbers from 0-9)
\Dmatches where the string DOES NOT contain digits
\smatches where the string contains a white space character
\Smatches where the string DOES NOT contain a white space character
\wmatches where the string contains any characters/digits
\Wmatches where the string DOES NOT contain any word characters/digits

Let’s try to match all instances of ‘winkle’ where it is at the end of a word:

re.findall(r"winkle\b", text)
output: ['winkle', 'winkle', 'winkle', 'winkle', 'winkle', 'winkle', 'winkle']

We know the word “twinkle” appears seven times in the poem, so we would expect to see seven occurrences of “winkle” in the output. We can also return this same output with “\B” since this returns all matching characters except not at the beginning of a word.

re.findall(r"\bwinkle", text)
output: ['winkle', 'winkle', 'winkle', 'winkle', 'winkle', 'winkle', 'winkle']

Let’s now match on all characters, not including white space characters with “\S”:

re.findall(r"\S", text)
output: ['T','w','i','n','k','l','e',',','t','w','i','n','k','l','e',',','l','i','t','t','l','e','s','t','a','r'...]

And to compare to the opposite, let’s use “\s” to return all white space characters from the poem:

re.findall(r"\s", text)
output: [' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',...]

Other Functions

Lastly, since this post was heavily focused on the re.findall function, let’s try some examples with the other re functions introduced at the start.

re.search simply returns an exact match of the specified string if it exists in the text:

re.search(r"twinkle", text)
output: <re.Match object; span=(9, 16), match='twinkle'>

re.split splits the text string on the specified character. For this piece of text, if we split on all commas, it actually makes the poem much easier to read:

re.split(r",", text)
output: 
['Twinkle',
 ' twinkle',
 ' little star',
 ' How I wonder what you are! Up above the world so high',
 ' Like a diamond in the sky. When the blazing sun is gone',
 ' When he nothing shines upon',
 ' Then you show your little light',
 ' Twinkle',
 ' twinkle',
 ' all the night. Then the traveler in the dark Thanks you for your tiny spark',
 ' How could he see where to go',
 ' If you did not twinkle so? In the dark blue sky you keep',
 ' Often through my curtains peep For you never shut your eye',
 ' Till the sun is in the sky. As your bright and tiny spark Lights the traveler in the dark',
 ' Though I know not what you are',
 ' Twinkle',
 ' twinkle',
 ' little star.']

The re.sub function allows us to substitute parts of strings for other strings. How would the poem change if we substituted in the first three instances of ‘twinkle’ with ‘glimmer’?

re.sub("twinkle", "glimmer", text, 3)
output: 'Twinkle, glimmer, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky. When the blazing sun is gone, When he nothing shines upon, Then you show your little light, Twinkle, glimmer, all the night. Then the traveler in the dark Thanks you for your tiny spark, How could he see where to go, If you did not glimmer so? In the dark blue sky you keep, Often through my curtains peep For you never shut your eye, Till the sun is in the sky. As your bright and tiny spark Lights the traveler in the dark, Though I know not what you are, Twinkle, twinkle, little star.'

Not bad.

I hope you enjoyed this introduction to RegEx with the re package in python, but there is so much more to explore. Check out these sources that I used if you want to learn more: click here for more on the Python Regex and here for more on RegEx in general.

Leave a comment