How can I make Scrapy crawl a FTP server that doesn't require a username and password? I've tried adding the url to the start urls, but Scrapy requires a username and password for FTP access. I've overridden start_requests()
to provide a default one (the username 'anonymous' and a blank password works when I try it with Linux's ftp
command), but I now get 550 responses from the server.
What's the right way to go about crawling FTP servers with Scrapy - ideally a way that would work with all FTP servers that don't require a username or password for access?
It is not documented, but Scrapy has this functionality built-in. There is an FTPDownloadHandler
which handles FTP download using twisted's FTPClient
. You don't need to call it directly, it would automagically turn on if there is an ftp
url requested.
In your spider, continue using scrapy.http.Request
class, but provide the ftp credentials in the meta
dictionary in ftp_user
and ftp_password
items:
yield Request(url, meta={'ftp_user': 'user', 'ftp_password': 'password'})
ftp_user
and ftp_password
are required. Also there are two optional keys that you can provide:
ftp_passive
(by default, enabled) sets FTP connection passive modeftp_local_filename
:
The latter is useful when you need to download a file and save it locally without processing the response in the spider callback.
As for anonymous usage, what credentials to provide depends on the ftp server itself. The user is "anonymous" and password is usually your email, any password or blank.
FYI, quote from the specification:
Anonymous FTP is a means by which archive sites allow general access to their archives of information. These sites create a special account called "anonymous". User "anonymous" has limited access rights to the archive host, as well as some operating restrictions. In fact, the only operations allowed are logging in using FTP, listing the contents of a limited set of directories, and retrieving files. Some sites limit the contents of a directory listing an anonymous user can see as well. Note that "anonymous" users are not usually allowed to transfer files TO the archive site, but can only retrieve files from such a site.
Traditionally, this special anonymous user account accepts any string as a password, although it is common to use either the password "guest" or one's electronic mail (e-mail) address. Some archive sites now explicitly ask for the user's e-mail address and will not allow login with the "guest" password. Providing an e-mail address is a courtesy that allows archive site operators to get some idea of who is using their services.
Trying it out on the console usually helps to see what password should you use, welcome message usually explicitly notes the password requirements. Real-world example:
$ ftp [email protected]
Connected to icebox.stratus.com.
220 Stratus-FTP-server
331 Anonymous login ok, send your complete email address as your password.
Password:
Here is a working example for mozilla public FTP:
import scrapy
from scrapy.http import Request
class FtpSpider(scrapy.Spider):
name = "mozilla"
allowed_domains = ["ftp.mozilla.org"]
handle_httpstatus_list = [404]
def start_requests(self):
yield Request('ftp://ftp.mozilla.org/pub/mozilla.org/firefox/releases/README',
meta={'ftp_user': 'anonymous', 'ftp_password': ''})
def parse(self, response):
print response.body
If you run the spider, you would see the contents of the README file on the console:
Older releases have known security vulnerablities, which are disclosed at
https://www.mozilla.org/security/known-vulnerabilities/
Mozilla strongly recommends you do not use them, as you are at risk of your computer
being compromised.
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With