Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape all contents from infinite scroll website? scrapy

I'm using scrapy.

The website i'm using has infinite scroll.

the website has loads of posts but i only scraped 13.

How to scrape the rest of the posts?

here's my code:

class exampleSpider(scrapy.Spider):
name = "example"
#from_date = datetime.date.today() - datetime.timedelta(6*365/12)
allowed_domains = ["example.com"]
start_urls = [
    "http://www.example.com/somethinghere/"
]

def parse(self, response):
  for href in response.xpath("//*[@id='page-wrap']/div/div/div/section[2]/div/div/div/div[3]/ul/li/div/h1/a/@href"):
    url = response.urljoin(href.extract())
    yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    #scrape contents code here
like image 742
Michimcchicken Avatar asked May 13 '16 10:05

Michimcchicken


1 Answers

Check the website code.

If the infinite scroll is automatically triggering js action, you could proceed as follows using the Alioth proposal: spynner

Following the spynner docs, you can find that can trigger jquery events.

Look up the library code to see which kind of events you can fire.

Try to generate a scroll to bottom event or create a css property change on any of the divs inside the scrollable content in the website. Following spynner docs, something like:

browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
# load here your website as spynner allows
browser.load_jquery(True)
ret = run_debug(browser.runjs,'window.scrollTo(0, document.body.scrollHeight);console.log(''scrolling...);')
# continue parsing ret 

It is not quite probable that an infinite scroll is triggered by an anchor link, but maybe can be triggered by a jquery action, not necesarry attached to a link. For this case use code like the following:

br.load('http://pypi.python.org/pypi')

anchors = br.webframe.findAllElements('#menu ul.level-two a')
# chooses an anchor with Browse word as key
anchor = [a for a in anchors if 'Browse' in a.toPlainText()][0]
br.wk_click_element_link(anchor, timeout=10)
output = br.show()
# save output in file: output.html or 
# plug this actions into your scrapy method and parse output var as you do 
# with response body

Then, run scrapy on the output.html file or, if you implemented it so, using the local memory variable you choosed to store the modified html after the js action.

As another solution, the website you are trying to parse might have an alternate render version in case the visitor browser has not js activated.

Try to render the website with a javascript disabled browser, and maybe that way, the website makes available an anchor link at the end of the content section.

Also there are successful implementations of crawler js navigation using the approach with Scrapy together with Selenium detailed in this so answer.

like image 93
Evhz Avatar answered Oct 04 '22 21:10

Evhz