I am using Python's HTMLParser
from html.parser
module.
I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close()
but I am not sure if this is the way to go.
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
login_form = False
if tag == "form":
print("finished")
self.close()
However this seems to have recursive effects ending with
File "/usr/lib/python3.4/re.py", line 282, in _compile
p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison
According to the docs, the close()
method does this:
Force processing of all buffered data as if it were followed by an end-of-file mark.
You're still inside the handle_starttag
and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.
From the description of reset()
this sounds more like what you want:
Reset the instance. Loses all unprocessed data.
but also this can't be called from the things which it calls, so this also shows recursion.
It sounds like you have two options:
raise an Exception (like for example a StopIteration
) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.
use a simple flag (True
/ False
) to signify whether you have aborted or not. At the very start of handle_starttag
just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag
as well then this would also check the flag. You can set the flag back to normal either when you receive a <html>
tag or by overwriting the feed
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With