Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to skip some file type while crawling with scrapy?

Tags:

mime

scrapy

I want to skip some file type link .exe .zip .pdf while crawling with scrapy, but don't want to use Rule with specific url regular. How?

Update:

Due to that it's hard to decide whether to follow this link just by Content-Type in response when the body hasn't been downloaded. I change to drop url in downloader middleware. thanks Peter and Leo.

like image 837
fxp Avatar asked Aug 27 '12 10:08

fxp


2 Answers

If you go to linkextractor.py within the Scrapy root directory, you will see the following:

"""
Common code and definitions used by Link extractors (located in
scrapy.contrib.linkextractor).
"""

# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
    'm4a',

    # other
    'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar',
]

However, since this applies to the linkextractor (and you don't want to use Rules), I am not sure that this will solve your problem (I just realized you specified that you didn't want to use Rules. I thought you had asked how to change the file-extension restrictions without needing to specify directly in a rule).

The good news is, you can also build your own downloader middleware and drop any/all requests to urls which have an undesirable extension. See Downloader Middlerware

You can get the requested url by accessing the request object's url attribute as follows: request.url

Basically, search the end of the string for '.exe' or whatever extension you want to drop, and if it contains said extentions, return an IgnoreRequest exception, and the request will immediately be dropped.

UPDATE

In order to process the request prior to it being downloaded, you need to make sure you define the 'process_request' method within your custom downloader middleware.

According to the Scrapy documentation

process_request

This method is called for each request that goes through the download middleware.

process_request() should return either None, a Response object, or a Request object.

If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

If it returns a Response object, Scrapy won’t bother calling ANY other request or exception middleware, or the appropriate download function; it’ll return that Response. Response middleware is always called on every Response.

If it returns a Request object, the returned request will be rescheduled (in the Scheduler) to be downloaded in the future. The callback of the original request will always be called. If the new request has a callback it will be called with the response downloaded, and the output of that callback will then be passed to the original callback. If the new request doesn’t have a callback, the response downloaded will be just passed to the original request callback.

If it returns an IgnoreRequest exception, the entire request will be dropped completely and its callback never called.

So essentially, just create a downloader class, add a method class process_request, which takes a request object and spider object as parameters. Then return the IgnoreRequest exception if the url contains unwanted extensions.

This should all occur prior to the page being downloaded. However, if you are wanting to process the response headers instead, than a request will have to be made to the webpage.

You could always implement both a process_request and process_response method in the downloader, with the idea being that obvious extensions will immediately be dropped, and than, if for some reason the url did not contain the file extension, the request would be process and caught in the process_request method (since you could verify in the headers)?

like image 60
Peter Kirby Avatar answered Nov 23 '22 17:11

Peter Kirby


.zip and .pdf are ignored by scrapy by default.

As a general rule you can either configure a rule to include only urls that match your regexp (.htm* in this case):

rules = (Rule(SgmlLinkExtractor(allow=('\.htm')), callback='parse_page', follow=True, ), )

or exclude the ones that match a regexp:

rules = (Rule(SgmlLinkExtractor(allow=('.*'), deny=('\.pdf', '\.zip')), callback='parse_page', follow=True, ), )

Read the documentation for more information.

like image 22
Leo Avatar answered Nov 23 '22 17:11

Leo