Find following tag with pyparsing

Question

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.

example:

import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)

result = target.searchString(""".....
   <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
   """)

I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.

EDIT:

Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:

import BeautifulSoup
import urllib
import re
import socket

socket.setdefaulttimeout(3)

# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()

success, failure = 0.0, 0.0

for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
    print url
    try:
        BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
    except IOError:
        pass
    except Exception, e:
        print e
        failure += 1
    else:
        success += 1


print failure / (failure + success)

When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.

PaulMcG · Accepted Answer

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

If you want to capture the character location of an expression within your parser, insert one of these, with a results name:

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)

Ned Batchelder · Answer

Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.

Find following tag with pyparsing

Tags:

python

html

parsing

pyparsing

ʞɔıu

2 Answers

PaulMcG

Ned Batchelder

Recent Activity

Donate For Us

Find following tag with pyparsing

Tags:

python

html

parsing

pyparsing

ʞɔıu

2 Answers

PaulMcG

Ned Batchelder

Related questions

Recent Activity

Donate For Us