Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't scrape some elements off of zillow website

I am trying to scrape content of the zillow website.

Ex- https://www.zillow.com/homedetails/689-Luis-Munoz-Marin-Blvd-APT-508-Jersey-City-NJ-07310/108625724_zpid/

The problem is I can't scrape contents of the price and tax history. I thought that they are javascript elements loading when the page loads and hence tried using selenium but i still can't get them. Following is what I tried.

Code

phistory = soup.find("div",{"id": "hdp-price-history"})
print phistory

Html

<div class="loading yui3-widget yui3-async-block yui3-complaintstable yui3-hdppricehistory yui3-hdppricehistory-content" id="hdp-price-history">
  div class="zsg-content-section zsg-loading-spinner_lg"></div>
</div>

This is the outermost element but doesn't have any elements inside.Also tried soup.find_all("table",class_ = "zsg-table yui3-toggle-content-minimized") which yields none.

like image 566
Karan Singh Avatar asked May 11 '17 03:05

Karan Singh


People also ask

Does Zillow block Web scraping?

Once you decide to scrape Zillow keep in mind that it uses anti-scraping techniques like captchas, IP blocking, and honeypot traps to prevent its data from scraping.


1 Answers

You can try to wait until required <table> generated and became visible:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

driver.get("https://www.zillow.com/homedetails/689-Luis-Munoz-Marin-Blvd-APT-508-Jersey-City-NJ-07310/108625724_zpid/")
table = wait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@id="hdp-price-history"]//table')))
print(table.text)

Output:

DATE EVENT PRICE $/SQFT SOURCE
05/03/17 Listed for sale $750,000+159% $534 KELLER WILLIAM...
06/15/11 Sold $290,000-38.3% $206 Public Record
10/14/05 Sold $470,000 $334 Public Record

You can also parse it without using BeautifulSoup, e.g.

print(table.find_element_by_xpath('.//td[text()="Listed for sale"]/following::span').text)

Output:

$750,000

or

print(table.find_element_by_xpath('.//td[text()="Sold"]/following::span').text)

Output:

$290,000
like image 191
Andersson Avatar answered Oct 27 '22 12:10

Andersson