Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make regex match non-greedy?

I have read in my programming book that .*? will usually make the regex not be greedy, and instead match the shortest possible match.

However, it isn't working as desired for the following:

regular expression: http.*?500.jpg

test string: http://google.com<img src="http://33.google.com/image/500.jpg

I want to match only the shortest, which is: http://33.google.com/image/500.jpg.

But it doesn't. It matches the entire string...

I've tried reading more on regex, however, I haven't been able to work it out.

How can I only select the shortest string match as like this example?

like image 521
BBedit Avatar asked Jun 13 '14 22:06

BBedit


3 Answers

I know there are two answers already, but sometimes it helps to have another way to look at it and handle it.

The Problem

When the engine is positioned before the first h, it makes its best effort to match the regex http.*?500.jpg. Can the regex match at that point? Yes, it can. After matching http, the engine keeps lazily matching until it meets 500.jpg. There is nothing to stop it. You have told it to match as only as many chars as necessary, and that is what it is doing.

In contrast, suppose you have this string with two 500.jpg

http://google.com<img src="http://google.com/500.jpg 1500.jpg 
                                                    ^ lazy .*? stops here
                                                             ^ greedy .* stops here

The greedy one will match the whole string. But the lazy one will stop as soon as it can: in the same place as before. This is where you can see the difference between greedy and lazy.

Workaround: Don't Use Dot-Star—Use The Right Token

Suppose you knew that each http string has a space or newline after it. You could use a lazy match with http\S*?\.jpg The point is that the \S*, which matches any character that is not a "whitespace character" (newlines, tabs etc) is not able to jump over the space, unlike the dot-star.

Reference

In addition, I highly recommend you read the article below, as it should help with any remaining confusion.

The Many Degrees of Regex Greed

like image 200
zx81 Avatar answered Oct 09 '22 18:10

zx81


http matches as early as possible, then .*? matches as few as possible, giving you a longer-than-necessary string.

You can instead make sure http matches as late as possible by adding a greedy .* before it:

import re
str = 'http://google.com<img src="http://33.google.com/image/500.jpg'
re.match('.*(http.*?500.jpg)', str).groups()[0]
like image 25
that other guy Avatar answered Oct 09 '22 17:10

that other guy


The regex engine processes the string character by character from left to right. Thus, when the first http is found, the regex engine tries to make the pattern succeed with the less characters as possible but from the current position (in other words: as soon as possible in the string).

With your example, to be sure to match the url that ends with 500.jpg, you can help the regex engine to find what you want with more informations, for example:

\bhttp://\S+/500\.jpg\b

informations added:

  • use of word boundaries \b
  • http:// to be more explicit
  • \S+ use the fact that there are no spaces in a url (spaces are generaly converted to %20)
  • the slash before the file name

Note: as you can see, when you add more informations in a pattern, you realize that sometimes lazy quantifiers are useless.

This is only an example that fits with your excerpt. You need adapt it to your situation. (imagine a string with URLs separated by commas, in this case you should replace \S by [^\s,])

like image 31
Casimir et Hippolyte Avatar answered Oct 09 '22 17:10

Casimir et Hippolyte