Regex Greedy

Created with Sketch.

Regex Greedy

Summary: in this tutorial, you’ll learn about the regex greedy mode and how it affects the way the quantifiers search for matches.

The problem with the regex greedy mode

Suppose you have the following string:

<a href="/" title="Go to homepage">Home</a>

Code language: PHP (php)

And you want to match the text within the quotes (""). To do that, you can use the following pattern that includes the dot (.) character class and the (+) quantifier:

".+"

Code language: PHP (php)

The meaning of the pattern is as follows:

  • " starts with a quote
  • . matches any character except the newline
  • + matches the character one or more times
  • " endswith the quote

The following uses the preg_match_all() function to find a match in the string with the pattern:

<?php

$str = '<a href="/" title="Go to the homepage">Home</a>';
$pattern = '/".+"/';

if (preg_match_all($pattern, $str, $matches)) {
print_r($matches);
}

Code language: PHP (php)

It returns the following:

Array
(
[0] => "/" title="Go to the homepage"
)

Code language: PHP (php)

This result is not what you expected.

The reason is that the quantifier (+) uses the greedy mode by default. In the greedy mode, the quantifier (+) tries to match its preceding element (a character) as many times as possible.

Let’s understand how the regex greedy mode works.

Understand how the regex greedy mode works

To match the $str with the $pattern, the regex engine will match every position in the $str with the $pattern starting from the first position in the string.

So the regex engine starts from the first character in the $str. Since it is < which does not match the quote (") in the pattern, the regex engine continues to search until it reaches the first quote (") in the string:

regex greedy - start matching

The regex engine looks at the pattern and matches the string with the next rule .+. Because the .+ rule matches a character one or more times, the regex engine matches all characters until it reaches the end of the string:

regex greedy

The regex engine examines the last rule in the pattern, which is a quote (“).

However, it already reaches the end of the string. There’s no more character to match. It was too greedy to go too far.

Therefore, the regex engine goes back from the end of the string to find the quote (“). This is called backtracking:

regex greedy backtracking

As a result, the match is the following substring which is not what you expected:

regex greedy
"/" title="Go to homepage"

Code language: PHP (php)

By default, other quantifiers also use the greedy mode. To fix this issue, you need to turn the greedy mode into a non-greedy (or lazy) mode by adding a question mark (?) to the quantifier like this:

".+?"

Code language: PHP (php)

The following code returns the expected result:

<?php

$str = '<a href="/" title="Go to the homepage">Home</a>';
$pattern = '/".+?"/';

if (preg_match_all($pattern, $str, $matches)) {
print_r($matches[0]);
}

Code language: PHP (php)

Output:

Array
(
[0] => "/"
[1] => "Go to the homepage"
)

Code language: PHP (php)

To understand how the regex non-greedy mode work, check out the regex non-greedy tutorial.

Summary

  • By default, quantifiers use the greedy mode.
  • A greedy quantifier tries to match its preceding element as many times as possible.

Leave a Reply

Your email address will not be published. Required fields are marked *