Python Regex Backreferences

Created with Sketch.

Python Regex Backreferences

Summary: in this tutorial, you’ll learn about Python regex backreferences and how to apply them effectively.

Introduction to the Python regex backreferences

Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.

The following shows the syntax of a backreference:

\N

Code language: Python (python)

Alternatively, you can use the following syntax:

\g<N>

Code language: Python (python)

In this syntax, N can be 1, 2, 3, etc. that represents the corresponding capturing group.

Note that the \g<0> refer to the entire match, which has the same value as the match.group(0).

Suppose you have a string with the duplicate word Python like this:

s = 'Python Python is awesome'

Code language: Python (python)

And you want to remove the duplicate word (Python) so that the result string will be:

Python is awesome

Code language: Python (python)

To do that, you can use a regular expression with a backreference.

First, match a word with one or more characters and one or more space:

'\w+\s+'

Code language: Python (python)

Second, create a capturing group that contains only the word characters:

'(\w+)\s+'

Code language: Python (python)

Third, create a backreference that references the first capturing group:

'(\w+)\s+\1'

Code language: Python (python)

In this pattern, the \1 is a backreference that references the (\w+) capturing group.

Finally, replace the entire match with the first capturing group using the sub() function from the re module:

import re

s = 'Python Python is awesome'

new_s = re.sub(r'(\w+)\s+\1', r'\1', s)

print(new_s)

Code language: Python (python)

Output:

Python is awesome

Code language: Python (python)

More Python regex backreference examples

Let’s take some more examples of using backreferences.

1) Using Python regex backreferences to get text inside quotes

Suppose you want to get the text within double quotes:

"This is regex backreference example"

Code language: Python (python)

Or single quote:

'This is regex backreference example'

Code language: Python (python)

But not mixed of single and double-quotes. The following will not match:

'not match"

Code language: Python (python)

To do this, you may use the following pattern:

'[\'"](.*?)[\'"]'

Code language: Python (python)

However, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa. For example:

import re

s = '"Python\'s awsome". She said'
pattern = '[\'"].*?[\'"]'

match = re.search(pattern, s)

print(match.group(0))

Code language: Python (python)

It returns the "Python' not "Python's awesome":

"Python'

Code language: Python (python)

To fix it, you can use a backreference:

r'([\'"]).*?\1'

Code language: Python (python)

The backreference \1 refers to the first capturing group. So if the subgroup starts with a single quote, the \1 will match the single quote. And if the subgroup starts with a double-quote, the \1 will match the double-quote.

For example:

import re

s = '"Python\'s awsome". She said'
pattern = r'([\'"])(.*?)\1'

match = re.search(pattern, s)
print(match.group())

Code language: Python (python)

Output:

"Python's awsome"

Code language: Python (python)

2) Using Python regex backreferences to find words that have at least one consecutive repeated character

The following example uses a backreference to find words that have at least one consecutive repeated character:

import re

words = ['apple', 'orange', 'strawberry']
pattern = r'\b\w*(\w)\1\w*\b'

results = [w for w in words if re.search(pattern, w)]

print(results)

Code language: Python (python)

Output:

['apple', 'strawberry']

Code language: Python (python)

Summary

  • Use a backreference \N to reference the capturing group N in a regular expression.

Leave a Reply

Your email address will not be published. Required fields are marked *