Using Beautiful Soup to find specific class

Question

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

When I try the find_all() function, I do not get any results:

results = soup.find_all('div', attrs={"class":"home-summary-row"})

However, if I take the HTML and cut it down to just the bits I want, eg.:

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> $1,342,144 </span>
        </div>
    </body>
</html>

I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?

Working example:

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

alecxe · Accepted Answer

Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:

html.parser (built-in, no additional modules needed)
lxml (the fastest, requires lxml to be installed)
html5lib (the most lenient, requires html5lib to be installed)

The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

As you can see, in your case, both html.parser and lxml do the job, but html5lib does not.

RobBenz · Answer

import requests
from bs4 import BeautifulSoup

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "home-summary-row"})

print g_data[1].text

#for item in g_data:
#        print item("span")[0].text
#        print '
'

I got this working too -- but it looks like someone beat me to it.

going to post anyways.

Soviut · Answer

According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.

You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

Using Beautiful Soup to find specific class

Tags:

python

html

beautifulsoup

web-scraping

SFBA26

3 Answers

alecxe

RobBenz

Soviut

Recent Activity

Donate For Us

Using Beautiful Soup to find specific class

Tags:

python

html

beautifulsoup

web-scraping

SFBA26

3 Answers

alecxe

RobBenz

Soviut

Related questions

Recent Activity

Donate For Us