Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scrapy to crawl a public FTP server

How can I make Scrapy crawl a FTP server that doesn't require a username and password? I've tried adding the url to the start urls, but Scrapy requires a username and password for FTP access. I've overridden start_requests() to provide a default one (the username 'anonymous' and a blank password works when I try it with Linux's ftp command), but I now get 550 responses from the server.

What's the right way to go about crawling FTP servers with Scrapy - ideally a way that would work with all FTP servers that don't require a username or password for access?

like image 587
false_azure Avatar asked Jan 04 '15 21:01

false_azure


1 Answers

It is not documented, but Scrapy has this functionality built-in. There is an FTPDownloadHandler which handles FTP download using twisted's FTPClient. You don't need to call it directly, it would automagically turn on if there is an ftp url requested.

In your spider, continue using scrapy.http.Request class, but provide the ftp credentials in the meta dictionary in ftp_user and ftp_password items:

yield Request(url, meta={'ftp_user': 'user', 'ftp_password': 'password'})

ftp_user and ftp_password are required. Also there are two optional keys that you can provide:

  • ftp_passive (by default, enabled) sets FTP connection passive mode
  • ftp_local_filename:
    • If not given, file data will come in the response.body, as a normal scrapy Response, which will imply that the entire file will be on memory.
    • if given, file data will be saved in a local file with the given name This helps when downloading very big files to avoid memory issues. In addition, for convenience the local file name will also be given in the response body.

The latter is useful when you need to download a file and save it locally without processing the response in the spider callback.

As for anonymous usage, what credentials to provide depends on the ftp server itself. The user is "anonymous" and password is usually your email, any password or blank.

FYI, quote from the specification:

Anonymous FTP is a means by which archive sites allow general access to their archives of information. These sites create a special account called "anonymous". User "anonymous" has limited access rights to the archive host, as well as some operating restrictions. In fact, the only operations allowed are logging in using FTP, listing the contents of a limited set of directories, and retrieving files. Some sites limit the contents of a directory listing an anonymous user can see as well. Note that "anonymous" users are not usually allowed to transfer files TO the archive site, but can only retrieve files from such a site.

Traditionally, this special anonymous user account accepts any string as a password, although it is common to use either the password "guest" or one's electronic mail (e-mail) address. Some archive sites now explicitly ask for the user's e-mail address and will not allow login with the "guest" password. Providing an e-mail address is a courtesy that allows archive site operators to get some idea of who is using their services.

Trying it out on the console usually helps to see what password should you use, welcome message usually explicitly notes the password requirements. Real-world example:

$ ftp [email protected]
Connected to icebox.stratus.com.
220 Stratus-FTP-server
331 Anonymous login ok, send your complete email address as your password.
Password: 

Here is a working example for mozilla public FTP:

import scrapy
from scrapy.http import Request

class FtpSpider(scrapy.Spider):
    name = "mozilla"
    allowed_domains = ["ftp.mozilla.org"]

    handle_httpstatus_list = [404]

    def start_requests(self):
        yield Request('ftp://ftp.mozilla.org/pub/mozilla.org/firefox/releases/README',
                      meta={'ftp_user': 'anonymous', 'ftp_password': ''})

    def parse(self, response):
        print response.body

If you run the spider, you would see the contents of the README file on the console:

Older releases have known security vulnerablities, which are disclosed at 

  https://www.mozilla.org/security/known-vulnerabilities/

Mozilla strongly recommends you do not use them, as you are at risk of your computer 
being compromised. 
...
like image 51
alecxe Avatar answered Sep 27 '22 17:09

alecxe