Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1
To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")
soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string
# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
href = heading_inner.find('h4').find('a').get('href')
car_urls.append('http://www.goo-net.com' + href)
for url in car_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#title
print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
#price of car itself
print(soup.find(class_='price1').string)
#price of car including tax
print(soup.find(class_='price2').string)
tds = soup.find(class_='subData').find_all('td')
# year
print(tds[0].string)
# distance
print(tds[1].string)
# displacement
print(tds[2].string)
# inspection
print(tds[3].string)
How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.
Any guidance would be appreciated. Thank you.
Example. Now, provide the url which we want to open in that web browser now controlled by our Python script. Now, we can use ID of the search toolbox for setting the element to select. driver.
Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
you can use selenium like below sample:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()
The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With