I thought I would write some quick code to download the number of "fans" a Facebook page has.
For some reason, despite a fair number of iterations I've tried, I can't get the following code to pick out the number of fans in the HTML. None of the other solutions I found on the web correctly match the regex in this case either. Surely it is possible to have some wildcard between the two matching bits?
The text I'd like to match against is "6 of X fans", where X is an arbitrary number of fans a page has - I would like to get this number.
I was thinking of polling this data intermittently and writing to a file but I haven't gotten around to that yet. I'm also wondering if this is headed in the right direction, as the code seems pretty clunky. :)
import urllib
import re
fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)
for lines in fbhandle.readlines():
ms = compiled.match(lines)
print ms #debugging
if ms: break
#ms.group()
print ms
fbhandle.close()
import urllib
import re
fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)
ms = compiled.search(fbhandle.read())
print ms.group(1).strip()
fbhandle.close()
You needed to use re.search()
instead. Using re.match()
tries to match the pattern against the whole document, but really you're just trying to match a piece inside the document. The code above prints: 79,110
. Of course, this will probably be a different number by the time it gets run by someone else.
Evan Fosmark already gave a good answer. This is just more info.
You have this line:
pattern = "6 of(.*)fans"
In general, this isn't a good regular expression. If the input text was:
"6 of 99 fans in the whole galaxy of fans"
Then the match group (the stuff inside the parentheses) would be:
" 99 fans in the whole galaxy of "
So, we want a pattern that will just grab what you want, even with a silly input text like the above.
In this case, it doesn't really matter if you match the white space, because when you convert a string to an integer, white space is ignored. But let's write the pattern to ignore white space.
With the *
wildcard, it is possible to match a string of length zero. In this case I think you always want a non-empty match, so you want to use +
to match one or more characters.
Python has non-greedy matching available, so you could rewrite with that. Older programs with regular expressions may not have non-greedy matching, so I'll also give a pattern that doesn't require non-greedy.
So, the non-greedy pattern:
pattern = "6 of\s+(.+?)\s+fans"
The other one:
pattern = "6 of\s+(\S+)\s+fans"
\s
means "any white space" and will match a space, a tab, and a few other characters (such as "form feed"). \S
means "any non-white-space" and matches anything that \s
would not match.
The first pattern does better than your first pattern with the silly input text:
"6 of 99 fans in the whole galaxy of fans"
It would return a match group of just 99
.
But try this other silly input text:
"6 of 99 crazed fans"
It would return a match group of 99 crazed
.
The second pattern would not match at all, because the word "crazed" isn't the word "fans".
Hmm. Here's one last pattern that should always do the right thing even with silly input texts:
pattern = "6 of\D*?(\d+)\D*?fans"
\d
matches any digit ('0'
to '9'
). \D
matches any non-digit.
This will successfully match anything that is remotely non-ambiguous:
"6 of 99 fans in the whole galaxy of fans"
The match group will be 99
.
"6 of 99 crazed fans"
The match group will be 99
.
"6 of 99 41 fans"
It will not match, because there was a second number in there.
To learn more about Python regular expressions, you can read various web pages. For a quick reminder, inside the Python interpreter, do:
>>> import re
>>> help(re)
When you are "scraping" text from a web page, you might sometimes run afoul of HTML codes. In general, regular expressions are not a good tool for disregarding HTML or XML markup (see here); you would probably do better to use Beautiful Soup to parse the HTML and extract the text, followed by a regular expression to grab the text you really wanted.
I hope this was interesting and/or educational.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With