I wrote a python script that processes a large amount of downloaded webpages HTML(120K pages). I need to parse them and extract some information from there. I tried using BeautifulSoup, which is easy and intuitive, but it seems to run super slowly. As this is something that will have to run routinely on a weak machine (on amazon) speed is important. is there an HTML/XML parser in python that will work much faster than BeautifulSoup? or must I resort to regex parsing..
The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, which is used to parse HTML files. It comes in handy for web crawling.
Parsing name and text attributes of tagsUsing the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.
html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
lxml is a fast xml and html parser: http://lxml.de/parsing.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With