# Intermediate | Regular Expressions In Python

## Escaping characters¶

Suppose you have a file of students and their grades on a recent exam.

Bob: B-
Mary: C
Susan: A+
Ronald: F
Gerry: A+
Mark: C+


If you want to find all students who received an A+, you might try using the regular expression A+, but this matches on Allison: C+

Allison: C+  <- match
Bob: B-
Mary: C
Susan: A+    <- match
Ronald: F
Gerry: A+    <- match


That's because + is a special regular expression metacharacter meaning "one or more repetitions". In other words, A+ means match one or more repetitions of A.

In order to match a plus sign (ignoring the default interpretation of +), you must escape it with a backslash. For example A\+ means match the string A+.

Similarly, \\ matches a single slash, \. matches a period, \( matches an open parenthesis, and so on.

## Raw string literals ('r' prefix)¶

Python doesn't always interpret strings exactly as you write them. For example, in the string "hello\nworld", Python interprets \n as a newline character. This becomes evident when you print() the string.

>>> print("hello\nworld")
hello
world


If you want to avoid this special interpretation of \n, you can prefix the string with the letter r.

>>> print(r"hello\nworld")
'hello\nworld'


A string prefixed with the letter r like this is called a raw string literal.

Tip

To see how Python interprets a string, just print() it.

>>> print("hello world")
hello world

>>> print("\hello world")
\hello world

>>> print("\\hello world")
\hello world

>>> print("\\\\hello world")
\\hello world

>>> print("hello \"world\"")
hello "world"

>>> print("hello\nworld")
hello
world


Raw string literals vs escaping characters

Raw string literals are an alternative to escaping characters. For example, consider these equivalent strings.

r"hello\nworld" == "hello\\nworld"
# True


## Capture Groups¶

Capture groups let you select parts of a regular expression match. For example, given the string,

I went to the market and bought 12 eggs, 6 carrots, and 2 hams.


The expression "\d+ \w+" matches these substrings

I went to the market and bought 12 eggs, 6 carrots, and 2 hams.


By wrapping \d+ and \w+ in parentheses, each match suddenly has two nested sub-matches. These are called "capture groups".

1   2
12 eggs

1   2
6 carrots

1  2
2 hams


In Python, if we identify the first match using re.search(),

import re

phrase = "I went to the market and bought 12 eggs, 6 carrots, and 2 hams."
first_match = re.search(pattern="(\d+) (\w+)", string=phrase)

print(first_match)
# <re.Match object; span=(32, 39), match='12 eggs'>


we can access its groups using the .groups() method,

first_match.groups()
# ('12', 'eggs')


or we can access each group individually using the .group() method

first_match.group(1)  # '12'
first_match.group(2)  # 'eggs'


### Nested and Non Capture Groups¶

Consider the following example that matches the capital Al or Bi followed by one or more word characters.

first_match = re.search(
pattern="(Al|Bi)(\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Bi', 'lly')


There are two capture groups in this example: (Al|Bi) and (\w+). What if we wanted to make the entire pattern a single capture group? You might try

first_match = re.search(
pattern="((Al|Bi)\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy', 'Bi')


but this returns two strings:

• one representing the outer capture group, identified by the outermost parentheses ((Al|Bi)\w+)
• another representing the inner capture group, identified by the innermost parentheses ((Al|Bi)\w+).

This is known as a nested capture group.

The issue of selecting a single capture group remains! The problem stems from the Or operator, because we cannot drop the parentheses surrounding (Al|Bi) without changing the pattern's meaning.

To get around this issue, we can change (Al|Bi) to (?:Al|Bi). The ?: bit signals the expression as a non capture group.

first_match = re.search(
pattern="((?:Al|Bi)\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy',)