I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!
ps. this is a starbucks stock quote scraper.
import urllib
import re
url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)
print found
.+
is greedy -- it matches until it can't match any more and gives back only as much as needed.
.+?
is not -- it stops at the first opportunity.
Examples:
Assume you have this HTML:
<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>
This regex matches the whole thing:
<span id="yfs_l84_sbux">(.+)<\/span>
It goes all the way to the end, then "gives back" one </span>
, but the rest of the regex matches that last </span>
, so the complete regex matches the entire HTML chunk.
But this regex stops at the first </span>
:
<span id="yfs_l84_sbux">(.+?)<\/span>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With