Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best library to parse HTML with Python 3 and example? [closed]

Tags:

python-3.x

I'm new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused at my array of options, and everything I find is suited for Python 2.x. I've read raves about Beautiful Soup, HTML5Lib and lxml, but I cannot figure out how to install any of these on Windows.

Questions:

  1. What HTML parser do you recommend?
  2. How do I install it? (Be gentle, I'm completely new to Python and remember I'm on Windows)
  3. Do you have a simple example on how to use the recommended library to snag HTML from a specific URL and return the value out of say something like this:

    <div class="foo"><table><tr><td>foo</td></tr></table><a class="link" href='/blahblah'>Link</a></div>

(say we want to return "/blahblah")

like image 698
TMC Avatar asked Mar 24 '10 02:03

TMC


People also ask

Which Python library did we use to parse HTML?

Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches. To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example.

What is best HTML parser in Python?

Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup . In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good.


1 Answers

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

like image 123
Humphrey Bogart Avatar answered Oct 04 '22 23:10

Humphrey Bogart