I am trying to scrape this website here:
However, it requires that I scroll down in order to collect additional data. I have no idea how to scroll down using Beautiful soup or python. Does anybody here know how?
The code is a bit of a mess but here it is.
import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time
class MLStripper(HTMLParser):
class MySpider(scrapy.Spider):
name = "A1Locker"
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
allowed_domains = ['https://www.a1lockerrental.com']
start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=all']
def parse(self, response):
url='http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
driver2 = webdriver.Firefox()
driver2.get(url2)
html2 = driver.page_source
soup2 = BeautifulSoup(html2, 'html.parser')
#soup.append(soup2)
#print soup
items = []
inside = "Indoor"
outside = "Outdoor"
inside_units = ["5 x 5", "5 x 10"]
outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x
20","10 x 25","10 x 30"]
sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
#print soup.findAll('span',{"class":"sss-unit-size"})
rateTagz = soup.findAll('p',{"class":"unit-special-offer"})
specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
typesTagz = soup.findAll('div',{"class":"unit-info"},)
rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})
specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
'name': "A1Locker"
}
size = []
for n in range(len(sizeTagz)):
print len(rateTagz)
print len(typesTagz)
if "Outside" in (typesTagz[n]).get_text():
size.append(re.findall(r'\d+',
(sizeTagz[n]).get_text()))
size.append(re.findall(r'\d+',
(sizeTagz2[n]).get_text()))
print "logic hit"
for i in range(len(size)):
yield {
#soup.findAll('p',{"class":"icon-bg"})
#'name': soup.find('strong', {'class':'high'}).text
'size': size[i]
#"special": (specialTagz[n]).get_text(),
#"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
#"size": i.css(".sss-unit-size::text").extract(),
#"types": "Outside"
}
driver.close()
The desired output of the code is to have it display the data collected from this webpage: http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all
To do so would require being able to scroll down to view the rest of the data. At least that is how it would be done in my mind.
Thanks, DM123
There is a webdriver function that provides this capability. BeautifulSoup doesn't do anything besides parse the site.
Check this out: http://webdriver.io/api/utility/scroll.html
The website you're trying to scrape is loading content dynamically using JavaScript. Unfortunately, many web scrapers, such as beautiful soup, cannot execute JavaScript on their own. There are a number of options, however, many in the form of headless browsers. A classic one is PhantomJS, but it may be worth taking a look at this great list of options on GitHub, some of which may play nicely with beautiful soup, such as Selenium.
Keeping Selenium in mind, the answer to this Stackoverflow question may also help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With