If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv
methods includes this particular implementation.
df = pd.read_csv('filename.zip')
Or the long form:
df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')
Description of the compression argument from the docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
I think you want to open
the ZipFile, which returns a file-like object, rather than read
:
In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))
In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN 24567 non-null values
REPORTDATETIME 24567 non-null values
SHIFT 24567 non-null values
OFFENSE 24567 non-null values
METHOD 24567 non-null values
LASTMODIFIEDDATE 24567 non-null values
BLOCKSITEADDRESS 24567 non-null values
BLOCKXCOORD 24567 non-null values
BLOCKYCOORD 24567 non-null values
WARD 24563 non-null values
ANC 24567 non-null values
DISTRICT 24567 non-null values
PSA 24567 non-null values
NEIGHBORHOODCLUSTER 24263 non-null values
BUSINESSIMPROVEMENTDISTRICT 3613 non-null values
dtypes: float64(4), int64(1), object(10)
It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.
import pandas as pd
df = pd.read_csv('filename.zip')
(Of course you will need to specify separator, header, etc. if they are different from the defaults.)
For "zip" files, you can use import zipfile
and your code will be working simply with these lines:
import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
with z.open("Crime_Incidents_in_2013.csv") as f:
train = pd.read_csv(f, header=0, delimiter="\t")
print(train.head()) # print the first 5 rows
And the result will be:
X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0 -77.054968548763071,38.899775938598317,0925135...
1 -76.967309569035052,38.872119553647011,1003352...
2 -76.996184958456539,38.927921847721443,1101010...
3 -76.943077541353617,38.883686046653935,1104551...
4 -76.939209158039446,38.892278093281632,1125028...
I guess what your looking is the following
from io import BytesIO
import requests
import pandas as pd
result = requests.get("https://www.xxx.zzz/file.zip")
df = pd.read_csv(BytesIO(result.content),compression='zip', header=0, sep=',', quotechar='"')
Read these article to understand why: https://medium.com/dev-bits/ultimate-guide-for-working-with-i-o-streams-and-zip-archives-in-python-3-6f3cf96dca50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With