Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with Regular expressions in python

Tags:

python

html

regex

Ok, so i'm working on a regular expression to search out all the header information in a site.

I've compiled the regular expression:

regex = re.compile(r'''
    <h[0-9]>\s?
    (<a[ ]href="[A-Za-z0-9.]*">)?\s?
    [A-Za-z0-9.,:'"=/?;\s]*\s?
    [A-Za-z0-9.,:'"=/?;\s]?
''',  re.X)

When i run this in python reg ex. tester, it works out wonderfully.

Sample data:

<body>
    <h1>Dog </h1>
    <h2>Cat </h2>
    <h3>Fancy </h3>
    <h1>Tall cup of lemons</h1>
    <h1><a href="dog.com">Dog thing</a></h1>
</body>

Now, in the REDemo, it works wonderfully.

When i put it in my python code, however, it only prints <a href="dog.com">

Here's my python code, I'm not sure if i'm doing something wrong or if something is lost in translation. I appreciate your help.

stories=[]
response = urllib2.urlopen('http://apricotclub.org/duh.html')
html = response.read().lower()
p = re.compile('<h[0-9]>\\s?(<a href=\"[A-Za-z0-9.]*\">)?\\s?[A-Za-z0-9.,:\'\"=/?;\\s]*\\s?[A-Za-z0-9.,:\'\"=/?;\\s]?')
stories=re.findall(p, html)
for i in stories:
    if len(i) >= 5:
        print i 

I should also note, that when i take out the (<a href=\"[A-Za-z0-9.]*\">)? from the regular expression it works fine for non-link <hN> lines.


1 Answers

This question has been asked in several forms over the last few days, so I'm going to say this very clearly.

Q: How do I parse HTML with Regular Expressions?

A: Please Don't.

Use BeautifulSoup, html5lib or lxml.html. Please.

like image 148
Jerub Avatar answered Jul 01 '26 05:07

Jerub