Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python HTMLParser - stop parsing

Tags:

python

html

dom

I am using Python's HTMLParser from html.parser module. I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close() but I am not sure if this is the way to go.

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        login_form = False
        if tag == "form":
            print("finished")
            self.close()

However this seems to have recursive effects ending with

  File "/usr/lib/python3.4/re.py", line 282, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison
like image 287
ps-aux Avatar asked Nov 01 '22 04:11

ps-aux


1 Answers

According to the docs, the close() method does this:

Force processing of all buffered data as if it were followed by an end-of-file mark.

You're still inside the handle_starttag and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.

From the description of reset() this sounds more like what you want:

Reset the instance. Loses all unprocessed data.

but also this can't be called from the things which it calls, so this also shows recursion.

It sounds like you have two options:

  • raise an Exception (like for example a StopIteration) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.

  • use a simple flag (True / False) to signify whether you have aborted or not. At the very start of handle_starttag just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag as well then this would also check the flag. You can set the flag back to normal either when you receive a <html> tag or by overwriting the feed method.

like image 141
Constance Avatar answered Nov 12 '22 17:11

Constance