Ok, so i'm working on a regular expression to search out all the header information in a site.
I've compiled the regular expression:
regex = re.compile(r'''
<h[0-9]>\s?
(<a[ ]href="[A-Za-z0-9.]*">)?\s?
[A-Za-z0-9.,:'"=/?;\s]*\s?
[A-Za-z0-9.,:'"=/?;\s]?
''', re.X)
When i run this in python reg ex. tester, it works out wonderfully.
Sample data:
<body>
<h1>Dog </h1>
<h2>Cat </h2>
<h3>Fancy </h3>
<h1>Tall cup of lemons</h1>
<h1><a href="dog.com">Dog thing</a></h1>
</body>
Now, in the REDemo, it works wonderfully.
When i put it in my python code, however, it only prints <a href="dog.com">
Here's my python code, I'm not sure if i'm doing something wrong or if something is lost in translation. I appreciate your help.
stories=[]
response = urllib2.urlopen('http://apricotclub.org/duh.html')
html = response.read().lower()
p = re.compile('<h[0-9]>\\s?(<a href=\"[A-Za-z0-9.]*\">)?\\s?[A-Za-z0-9.,:\'\"=/?;\\s]*\\s?[A-Za-z0-9.,:\'\"=/?;\\s]?')
stories=re.findall(p, html)
for i in stories:
if len(i) >= 5:
print i
I should also note, that when i take out the (<a href=\"[A-Za-z0-9.]*\">)? from the regular expression it works fine for non-link <hN> lines.
This question has been asked in several forms over the last few days, so I'm going to say this very clearly.
Use BeautifulSoup, html5lib or lxml.html. Please.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With