Best library to parse HTML with Python 3 and example? [closed]

Tags:

python-3.x

I'm new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused at my array of options, and everything I find is suited for Python 2.x. I've read raves about Beautiful Soup, HTML5Lib and lxml, but I cannot figure out how to install any of these on Windows.

Questions:

What HTML parser do you recommend?
How do I install it? (Be gentle, I'm completely new to Python and remember I'm on Windows)
Do you have a simple example on how to use the recommended library to snag HTML from a specific URL and return the value out of say something like this:

<div class="foo"><table><tr><td>foo</td></tr></table><a class="link" href='/blahblah'>Link</a></div>

(say we want to return "/blahblah")

698

asked Mar 24 '10 02:03

TMC

1 Answers

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

123

answered Oct 04 '22 23:10

Humphrey Bogart

Related questions
                            
                                How to integrate SimpleGUI with Python 2.7 and 3.0 shell
                            
                                How can I write asyncio coroutines that optionally act as regular functions?
                            
                                MIMEText UTF-8 encode problems when sending email
                            
                                How to pass an array to python through command line [duplicate]
                            
                                django - update date automatically after a value change
                            
                                How can we get the default behavior of __repr__()?
                            
                                Skip unittest if some-condition in SetUpClass fails
                            
                                Displaying pair plot in Pandas data frame
                            
                                pip install dryscrape fails with "error: [Errno 2] No such file or directory: 'src/webkit_server'"?
                            
                                How to configure Atom to run Python3 scripts?
                            
                                How to convert from Base64 to string Python 3.2 [duplicate]
                            
                                urllib.urlretrieve with custom header
                            
                                Iterating over dict values
                            
                                How to find the largest number(s) in a list of elements, possibly non-unique?
                            
                                Why is equivalent Python code so much slower
                            
                                Is there a way of subclassing from dict and collections.abc.MutableMapping together?
                            
                                Get the format in dateutil.parse
                            
                                EOFError: EOF when reading a line
                            
                                Load pickled object in different file - Attribute error
                            
                                Pandas filter data frame rows by function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With