Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this regex being greedy?

I am trying to extract all links that have /thumb/ in it within ""'s. Actually i only need to use the images src. I dont know if images will end with jpg or if there will be case sensitivity problems, etc. I really only care about the full link.

m = Regex.Match(page, @"""(.+?/thumbs/.+?)""");
//...
var thumbUrl = m.Groups[1].Value;

My full code

    var page = DownloadWebPage(url);
    var reg = new Regex(@"Elements\s+\((.*)\)", RegexOptions.Multiline);
    var m = reg.Match(page);
    var szEleCount= m.Groups[1].Value;
    int eleCount = int.Parse(szEleCount);

    m = Regex.Match(page, @"""(.+?/thumbs/.+?)""");
    while (m.Success)
    {
        var thumbUrl = m.Groups[1].Value;
        //i break here to see a problem
        m = m.NextMatch();
    }

thumbUrl looks like

center\"> ... lot of text, no /thumbs/ ... src=\"http://images.fdhkdhfkd.com/thumbs/dfljdkl/22350.jpg


1 Answers

Nongreedy regular expressions can be slow because the engine has to do a lot of backtracking.

This one uses only greedy expressions:

@"""([^""]*/thumbs/[^""]*)"""

Instead of matching the least amount of anything, it matches as many non-double-quotes as it can.

like image 191
Andomar Avatar answered Mar 29 '26 22:03

Andomar