Skip to content

X-Men Problem


Here's a quote from the movie X-Men

quote = r"Dr. Grey, how do you feel about the Senator's Statement? Is there a mutant plot to overthrow the government?"

Given a dynamically generated list of words like the following

searchwords = ['the', 'Dr.', 'to']

highlight each word by wrapping it in a <span></span> tag with class="highlight" as shown below.

Expected result
expected = r"""<span class="highlight">Dr.</span> Grey, how do you feel about <span class="highlight">the</span> Senator's Statement? Is there a mutant plot <span class="highlight">to</span> overthrow <span class="highlight">the</span> government?"""

Here, the should not match in there, to should not match in Senator, and note the non-word . character after Dr in one of the search words. Since the search words are not known at design-time, you need to build the pattern dynamically accounting for the fact these words must be matched as whole words.

Regex Functions
Function Description Return Value
re.findall(pattern, string, flags=0) Find all non-overlapping occurrences of pattern in string list of strings, or list of tuples if > 1 capture group
re.finditer(pattern, string, flags=0) Find all non-overlapping occurrences of pattern in string iterator yielding match objects
re.search(pattern, string, flags=0) Find first occurrence of pattern in string match object or None
re.split(pattern, string, maxsplit=0, flags=0) Split string by occurrences of pattern list of strings
re.sub(pattern, repl, string, count=0, flags=0) Replace pattern with repl new string with the replacement(s)
Regex Patterns
Pattern Description
[abc] a or b or c
[^abc] not (a or b or c)
[a-z] a or b ... or y or z
[1-9] 1 or 2 ... or 8 or 9
\d digits [0-9]
\D non-digits [^0-9]
\s whitespace [ \t\n\r\f\v]
\S non-whitespace [^ \t\n\r\f\v]
\w alphanumeric [a-zA-Z0-9_]
\W non-alphanumeric [^a-zA-Z0-9_]
. any character
x* zero or more repetitions of x
x+ one or more repetitions of x
x? zero or one repetitions of x
{m} m repetitions
{m,n} m to n repetitions
{m,n} m to n repetitions
\\, \., \* backslash, period, asterisk
\b word boundary
^hello starts with hello
bye$ ends with bye
(...) capture group
(po|go) po or go

Try with Google Colab