Parsing HTML using Python

People also ask

How do you parse in HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Can I use HTML with Python?

You are able to run a Python file using HTML using PHP.

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

Compared to the other parser libraries lxml is extremely fast:

http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

Related questions
                            
                                Format / Suppress Scientific Notation from Python Pandas Aggregation Results
                            
                                Applying function with multiple arguments to create a new pandas column
                            
                                How to get the last N rows of a pandas DataFrame?
                            
                                What is the source code of the "this" module doing?
                            
                                Object of custom type as dictionary key
                            
                                How to get exit code when using Python subprocess communicate method?
                            
                                Understanding __getitem__ method
                            
                                Bulk package updates using Conda
                            
                                How to find which columns contain any NaN value in Pandas dataframe
                            
                                Replacing blank values (white space) with NaN in pandas
                            
                                How to create a temporary directory and get its path/ file name?
                            
                                What exactly does the .join() method do?
                            
                                What is the best way to implement nested dictionaries?
                            
                                Convert string to Python class object?
                            
                                Python logging: use milliseconds in time format
                            
                                Python3: ImportError: No module named '_ctypes' when using Value from module multiprocessing
                            
                                How do I print bold text in Python?
                            
                                Why can't Python's raw string literals end with a single backslash?
                            
                                warning about too many open figures
                            
                                How to put individual tags for a matplotlib scatter plot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing HTML using Python

Tags:

python

xml-parsing

html-parsing

People also ask

Recent Activity

Donate For Us