Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scrapy to to find and download pdf files from a website

Tags:

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:

import urlparse import scrapy  from scrapy.http import Request  class pwc_tax(scrapy.Spider):     name = "pwc_tax"      allowed_domains = ["www.pwc.com"]     start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]      def parse(self, response):         base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"         for a in response.xpath('//a[@href]/@href'):             link = a.extract()             if link.endswith('.pdf'):                 link = urlparse.urljoin(base_url, link)                 yield Request(link, callback=self.save_pdf)      def save_pdf(self, response):         path = response.url.split('/')[-1]         with open(path, 'wb') as f:             f.write(response.body) 

I run this code at the command line with

scrapy crawl mySpider 

and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.

like image 746
Murface Avatar asked Mar 21 '16 15:03

Murface


People also ask

How do you scrape data from a PDF using Scrapy?

Scrapy spider crawls the web page to find the pdf file online which is to be scrapped, then the URL of that pdf file is obtained from another variable URL, then the urllib is used to open the URL file and create a reader object of PyPDF2 lib by passing the stream link of the URL to the parameter of the Object's ...

How do you scrape data from a website using Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .


1 Answers

The spider logic seems incorrect.

I had a quick look at your website, and seems there are several types of pages:

  1. http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html the initial page
  2. Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html which could be navigated from page #1
  3. Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf which could be navigated from page #2

Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.

EDITED:

I have updated your code, and here's something that actually works:

import urlparse import scrapy  from scrapy.http import Request  class pwc_tax(scrapy.Spider):     name = "pwc_tax"      allowed_domains = ["www.pwc.com"]     start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]      def parse(self, response):         for href in response.css('div#all_results h3 a::attr(href)').extract():             yield Request(                 url=response.urljoin(href),                 callback=self.parse_article             )      def parse_article(self, response):         for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():             yield Request(                 url=response.urljoin(href),                 callback=self.save_pdf             )      def save_pdf(self, response):         path = response.url.split('/')[-1]         self.logger.info('Saving PDF %s', path)         with open(path, 'wb') as f:             f.write(response.body) 
like image 193
starrify Avatar answered Sep 19 '22 04:09

starrify