I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:
import urlparse import scrapy from scrapy.http import Request class pwc_tax(scrapy.Spider): name = "pwc_tax" allowed_domains = ["www.pwc.com"] start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] def parse(self, response): base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html" for a in response.xpath('//a[@href]/@href'): link = a.extract() if link.endswith('.pdf'): link = urlparse.urljoin(base_url, link) yield Request(link, callback=self.save_pdf) def save_pdf(self, response): path = response.url.split('/')[-1] with open(path, 'wb') as f: f.write(response.body)
I run this code at the command line with
scrapy crawl mySpider
and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.
Scrapy spider crawls the web page to find the pdf file online which is to be scrapped, then the URL of that pdf file is obtained from another variable URL, then the urllib is used to open the URL file and create a reader object of PyPDF2 lib by passing the stream link of the URL to the parameter of the Object's ...
While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
The spider logic seems incorrect.
I had a quick look at your website, and seems there are several types of pages:
Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.
EDITED:
I have updated your code, and here's something that actually works:
import urlparse import scrapy from scrapy.http import Request class pwc_tax(scrapy.Spider): name = "pwc_tax" allowed_domains = ["www.pwc.com"] start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] def parse(self, response): for href in response.css('div#all_results h3 a::attr(href)').extract(): yield Request( url=response.urljoin(href), callback=self.parse_article ) def parse_article(self, response): for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract(): yield Request( url=response.urljoin(href), callback=self.save_pdf ) def save_pdf(self, response): path = response.url.split('/')[-1] self.logger.info('Saving PDF %s', path) with open(path, 'wb') as f: f.write(response.body)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With