A fast python HTML parser [closed]

Tags:

I wrote a python script that processes a large amount of downloaded webpages HTML(120K pages). I need to parse them and extract some information from there. I tried using BeautifulSoup, which is easy and intuitive, but it seems to run super slowly. As this is something that will have to run routinely on a weak machine (on amazon) speed is important. is there an HTML/XML parser in python that will work much faster than BeautifulSoup? or must I resort to regex parsing..

772

asked Mar 12 '12 16:03

WeaselFox

1 Answers

lxml is a fast xml and html parser: http://lxml.de/parsing.html

answered Sep 28 '22 09:09

Marcin

Related questions
                            
                                Reverse Geocoding Without Web Access
                            
                                Python recursion with list returns None [duplicate]
                            
                                How do you 'remove' a numpy array from a list of numpy arrays?
                            
                                How to move a local django made site into another machine?
                            
                                Python correctness (i.e., lint) analyzing for Notepad++
                            
                                How is introspection useful?
                            
                                Elegant pattern for mutually exclusive keyword args?
                            
                                Is it possible to override Sphinx autodoc for specific functions?
                            
                                Python Smooth Time Series Data
                            
                                Simple python inheritance
                            
                                How do i parse a string in python and write it as an xml to a new xml file?
                            
                                Is it ok to spawn threads in a wsgi-application?
                            
                                Pythonwin - print function not working [duplicate]
                            
                                Python: Getting files into an archive without the directory?
                            
                                PyPy significantly slower than CPython
                            
                                Iterate through words of a file in Python
                            
                                High performance mass short string search in Python
                            
                                subprocess.Popen() IO redirect
                            
                                Close all open files in ipython
                            
                                Speed of Python Extensions in C vs. C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

A fast python HTML parser [closed]

Tags:

python

html

xml

beautifulsoup

WeaselFox

People also ask

1 Answers

Marcin

Recent Activity

Donate For Us