Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex: Difference between (.+) and (.+?)

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!

ps. this is a starbucks stock quote scraper.

import urllib
import re

url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)

print found

like image 325
user3739620 Avatar asked Dec 20 '22 13:12

user3739620


1 Answers

.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.

.+? is not -- it stops at the first opportunity.

Examples:

Assume you have this HTML:

<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>

This regex matches the whole thing:

<span id="yfs_l84_sbux">(.+)<\/span>

It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.

But this regex stops at the first </span>:

<span id="yfs_l84_sbux">(.+?)<\/span>
like image 52
elixenide Avatar answered Jan 01 '23 20:01

elixenide