I am trying to read a file from an FTP server. The file is a .gz
file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.
I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?
When trying StringIO I got the error:
>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed
I just need to know how can I get data into some variable and loop on it until the file from FTP is read.
I appreciate your time and help. Thanks!
Downloading files We're using RETR command, which downloads a copy of a file on the server, we provide the file name we want to download as the first argument to the command, and the server will send a copy of the file to us. The ftp.
The ftp command uses the File Transfer Protocol (FTP) to transfer files between the local host and a remote host or between two remote hosts. Remote execution of the ftp command is not recommended. The FTP protocol allows data transfer between hosts that use dissimilar file systems.
Make sure to login to the ftp server first. After this, use retrbinary
which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.
from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
data.append(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)
Bonus points: how about we decompress the string while we're at it?
Easy mode, using data string above
import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()
Little bit better, full solution:
from ftplib import FTP
import gzip
import StringIO
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
sio = StringIO.StringIO()
def handle_binary(more_data):
sio.write(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)
uncompressed = zippy.read()
In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).
There are two easy ways I can think of to download a file using FTP and store it locally:
Using ftplib
:
from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('pub/pmc')
ftp.retrbinary('RETR PMC-ids.csv.gz', open('PMC-ids.csv.gz', 'wb').write)
ftp.quit()
Using urllib
from urllib import urlretrieve
urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", "PMC-ids.csv.gz")
If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using urllib2
:
from urllib2 import urlopen
u = urlopen("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt")
for line in u:
print line
which prints your file line by line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With