I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off. From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok. You could save the pdf in the spider callback: <pre class="prettyprint"><code>def parse_listing(self, response): # ... extract pdf urls for url in pdf_urls: yield Request(url, callback=self.save_pdf) def save_pdf(self, response): path = self.get_path(response.url) with open(path, "wb") as f: f.write(response.body) </code></pre> If you choose to do it in a pipeline: <pre class="prettyprint"><code># in the spider def parse_pdf(self, response): i = MyItem() i['body'] = response.body i['url'] = response.url # you can add more metadata to the item return i # in your pipeline def process_item(self, item, spider): path = self.get_path(item['url']) with open(path, "wb") as f: f.write(item['body']) # remove body and add path as reference del item['body'] item['path'] = path # let item be processed by other pipelines. ie. db store return item </code></pre> [1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. <code>wget</code>)

There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline: https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

Should I create pipeline to save files with scrapy?

Tags:

python

scrapy

web-crawler

pipeline

I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.

From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?

947

asked Aug 19 '11 14:08

John Lotacs

2 Answers

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

If you choose to do it in a pipeline:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

answered Oct 24 '22 19:10

R. Max

There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

answered Oct 24 '22 19:10

Deming

Related questions
                            
                                How to extract subjects in a sentence and their respective dependent phrases?
                            
                                How to have actual values in matplotlib Pie Chart displayed
                            
                                Python __attrs__ explained
                            
                                Panda Python - dividing a column by 100 (then rounding by 2.dp)
                            
                                keras - cannot import name Conv2D
                            
                                Group duplicate column IDs in pandas dataframe
                            
                                Use dictionary to replace a string within a string in Pandas columns
                            
                                PyInstaller WARNING: lib not found
                            
                                Does Kafka python API support stream processing?
                            
                                Django one of 2 fields must not be null
                            
                                Ansible + Ubuntu 18.04 + MySQL = "The PyMySQL (Python 2.7 and Python 3.X) or MySQL-python (Python 2.X) module is required."
                            
                                What is the difference between MaxPool and MaxPooling layers in Keras?
                            
                                Determine if a named parameter was passed
                            
                                Embedding icon in .exe with py2exe, visible in Vista?
                            
                                Regular expression implementation details
                            
                                Using python to develop web application
                            
                                tokenize a string keeping delimiters in Python
                            
                                How to plot data against specific dates on the x-axis using matplotlib
                            
                                Can I use cStringIO the same as StringIO?
                            
                                Element-wise power of scipy.sparse matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With