Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape dynamic webpages by Python

[What I'm trying to do]

Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

[Issue]

To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")

soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string

# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
    href = heading_inner.find('h4').find('a').get('href')
    car_urls.append('http://www.goo-net.com' + href)

for url in car_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    #title
    print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
    #price of car itself
    print(soup.find(class_='price1').string)
    #price of car including tax
    print(soup.find(class_='price2').string)

    tds = soup.find(class_='subData').find_all('td')
    # year
    print(tds[0].string)
    # distance
    print(tds[1].string)
    # displacement
    print(tds[2].string)
    # inspection
    print(tds[3].string)

[What I'd like to know]

How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.

[My environment]

  • Windows 8.1
  • Python 3.5
  • PyDev (Eclipse)
  • BeautifulSoup4

Any guidance would be appreciated. Thank you.

like image 879
dixhom Avatar asked Nov 19 '15 05:11

dixhom


People also ask

How do you web scrape a dynamic web using Python?

Example. Now, provide the url which we want to open in that web browser now controlled by our Python script. Now, we can use ID of the search toolbox for setting the element to select. driver.

Which Python module is best for web scraping dynamic pages?

Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.


2 Answers

you can use selenium like below sample:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click() 
like image 164
ahmad valipour Avatar answered Oct 19 '22 17:10

ahmad valipour


The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.

like image 4
ChrisGuest Avatar answered Oct 19 '22 16:10

ChrisGuest