Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find following tag with pyparsing

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.

example:

import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)

result = target.searchString(""".....
   <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
   """)

I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.

EDIT:

Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:

import BeautifulSoup
import urllib
import re
import socket

socket.setdefaulttimeout(3)

# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()

success, failure = 0.0, 0.0

for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
    print url
    try:
        BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
    except IOError:
        pass
    except Exception, e:
        print e
        failure += 1
    else:
        success += 1


print failure / (failure + success)

When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.

like image 523
ʞɔıu Avatar asked Feb 28 '23 01:02

ʞɔıu


2 Answers

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

If you want to capture the character location of an expression within your parser, insert one of these, with a results name:

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)
like image 95
PaulMcG Avatar answered Mar 07 '23 10:03

PaulMcG


Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.

like image 28
Ned Batchelder Avatar answered Mar 07 '23 09:03

Ned Batchelder