I'm new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused at my array of options, and everything I find is suited for Python 2.x. I've read raves about Beautiful Soup, HTML5Lib and lxml, but I cannot figure out how to install any of these on Windows.
Questions:
Do you have a simple example on how to use the recommended library to snag HTML from a specific URL and return the value out of say something like this:
<div class="foo"><table><tr><td>foo</td></tr></table><a class="link" href='/blahblah'>Link</a></div>
(say we want to return "/blahblah")
Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches. To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example.
Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup . In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good.
Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.
Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.
As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With