Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pandas read_csv with zip compression

Tags:

python

pandas

I'm trying to use read_csv in pandas to read a zipped file from an FTP server. The zip file contains just one file, as is required.

Here's my code:

pd.read_csv('ftp://ftp.fec.gov/FEC/2016/cn16.zip', compression='zip')

I get this error:

AttributeError: addinfourl instance has no attribute 'seek'

I get this error in both pandas 18.1 and 19.0. Am I missing something, or could this be a bug?

like image 417
itzy Avatar asked Nov 22 '16 14:11

itzy


3 Answers

pandas now supports to load data straight from zip or other compressed files to DataFrame.

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.

import pandas as pd

df = pd.read_csv("path_to_file.zip")
# or
df = pd.read_csv("path_to_file.zip", compression="zip")
like image 84
Vlad Bezden Avatar answered Sep 21 '22 19:09

Vlad Bezden


Although I'm not completely sure why you get the error, you can get around it by opening the url using urllib2 and writing the data to an in-memory binary stream, as shown here. In addition, we have to specify the correct separator, or else we would receive another error.

import io
import urllib2 as urllib
import pandas as pd

r = urllib.urlopen('ftp://ftp.fec.gov/FEC/2016/cn16.zip')
df = pd.read_csv(io.BytesIO(r.read()), compression='zip', sep='|', header=None)

As far as the error itself, I think pandas is trying to use seek on the "zip file" prior to downloading the url contents (so it's not really a zip file), which would result in that error.

like image 40
PyNoob Avatar answered Sep 22 '22 19:09

PyNoob


header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/54.0.1',}
remotezip = requests.get(url, headers=header)
root = zipfile.ZipFile(io.BytesIO(remotezip.content))
for name in root.namelist():
            df = pd.read_csv(root.open(name)) 

Taken from my own blog post: Read zipped csv files in python pandas without downloading zipfile

like image 24
Vinod Avatar answered Sep 18 '22 19:09

Vinod