Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML using Python

People also ask

How do you parse in HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Can I use HTML with Python?

You are able to run a Python file using HTML using PHP.


So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.


I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

Compared to the other parser libraries lxml is extremely fast:

  • http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
  • http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation


I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.


I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)