Using Scrapy to to find and download pdf files from a website

Tags:

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:

import urlparse import scrapy  from scrapy.http import Request  class pwc_tax(scrapy.Spider):     name = "pwc_tax"      allowed_domains = ["www.pwc.com"]     start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]      def parse(self, response):         base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"         for a in response.xpath('//a[@href]/@href'):             link = a.extract()             if link.endswith('.pdf'):                 link = urlparse.urljoin(base_url, link)                 yield Request(link, callback=self.save_pdf)      def save_pdf(self, response):         path = response.url.split('/')[-1]         with open(path, 'wb') as f:             f.write(response.body)

I run this code at the command line with

scrapy crawl mySpider

and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.

746

asked Mar 21 '16 15:03

Murface

1 Answers

The spider logic seems incorrect.

I had a quick look at your website, and seems there are several types of pages:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html the initial page
Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html which could be navigated from page #1
Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf which could be navigated from page #2

Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.

EDITED:

I have updated your code, and here's something that actually works:

import urlparse import scrapy  from scrapy.http import Request  class pwc_tax(scrapy.Spider):     name = "pwc_tax"      allowed_domains = ["www.pwc.com"]     start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]      def parse(self, response):         for href in response.css('div#all_results h3 a::attr(href)').extract():             yield Request(                 url=response.urljoin(href),                 callback=self.parse_article             )      def parse_article(self, response):         for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():             yield Request(                 url=response.urljoin(href),                 callback=self.save_pdf             )      def save_pdf(self, response):         path = response.url.split('/')[-1]         self.logger.info('Saving PDF %s', path)         with open(path, 'wb') as f:             f.write(response.body)

193

answered Sep 19 '22 04:09

starrify

Related questions
                            
                                How to mount a directory in a Docker container to the host?
                            
                                Calculate mean for selected rows for selected columns in pandas data frame
                            
                                Using Spark to write a parquet file to s3 over s3a is very slow
                            
                                async/await and opening a FileStream?
                            
                                Hashing an array in c#
                            
                                Filter array of objects that contains string using lodash
                            
                                How to unit test if function output list of dictionaries?
                            
                                Angular2 router, get route data from url, to display breadcrumbs
                            
                                Angular 2: How to conditionally load a Component in a Route asynchronously?
                            
                                Compile-time or runtime detection within a constexpr function
                            
                                How do I enable auto code formatting for flake8 in PyCharm
                            
                                When creating a composer package, what "Package Type" should I choose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With