I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.
From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?
Typical uses of item pipelines are: cleansing HTML data. validating scraped data (checking that the items contain certain fields) checking for duplicates (and dropping them)
Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).
Scrapy provides an item pipeline for downloading images attached to a particular item, for example, when you scrape products and also want to download their images locally.
Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response):
# ... extract pdf urls
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
If you choose to do it in a pipeline:
# in the spider
def parse_pdf(self, response):
i = MyItem()
i['body'] = response.body
i['url'] = response.url
# you can add more metadata to the item
return i
# in your pipeline
def process_item(self, item, spider):
path = self.get_path(item['url'])
with open(path, "wb") as f:
f.write(item['body'])
# remove body and add path as reference
del item['body']
item['path'] = path
# let item be processed by other pipelines. ie. db store
return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget
)
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With