Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A fast python HTML parser [closed]

I wrote a python script that processes a large amount of downloaded webpages HTML(120K pages). I need to parse them and extract some information from there. I tried using BeautifulSoup, which is easy and intuitive, but it seems to run super slowly. As this is something that will have to run routinely on a weak machine (on amazon) speed is important. is there an HTML/XML parser in python that will work much faster than BeautifulSoup? or must I resort to regex parsing..

like image 772
WeaselFox Avatar asked Mar 12 '12 16:03

WeaselFox


People also ask

What does HTML parser do in Python?

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.

How do you parse an HTML file in Python?

Parsing name and text attributes of tagsUsing the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.

Which parser creates valid html5 pages in Python?

html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.


1 Answers

lxml is a fast xml and html parser: http://lxml.de/parsing.html

like image 57
Marcin Avatar answered Sep 28 '22 09:09

Marcin