This question is about boosting Pandas' performance during stacking and unstacking operation.
The issue is that I have a large dataframe (~2GB). I followed this blog to compress it to ~150MB successfully. However, my stacking and unstacking operation take infinite amount of time such that I have to kill the kernel and restart everything.
I have also used R's data.table
package, and it just flies, meaning it completes the operation in <1 second.
I researched this on SO. It seems that some people have pointed to map-reduce
on Dataframe unstack performance - pandas thread, but I am not sure about it for two reasons:
stack
and unstack
on uncompressed runs fine in pandas
, but I can't do this on my original dataset because of memory problems.data.table
easily (<1 second) converts from long to wide format. I managed to cut a small feed (5MB) for representation purpose for SO. The feed has been uploaded to http://www.filedropper.com/ddataredact. This file should be able to reproduce the problem.
Here's my pandas
code:
import pandas as pd
#Added code to generate test data
data = {'ANDroid_Margin':{'type':'float','len':13347},
'Name':{'type':'cat','len':71869},
'Geo1':{'type':'cat','len':4},
'Geo2':{'type':'cat','len':31},
'Model':{'type':'cat','len':2}}
ddata_i = pd.DataFrame()
len_data =114348
#categorical
for colk,colv in data.items():
print("Processing column:",colk)
#Is the data type numeric?
if data[colk]['type']=='cat':
chars = string.digits + string.ascii_lowercase
replacement_value = [
"".join(
[random.choice(chars) for i in range(5)]
) for j in range(data[colk]['len'])]
else:
replacement_value = np.random.uniform(
low=0.0, high=20.0, size=(data[colk]['len'],))
ddata_i[colk] = np.random.choice(
replacement_value,size=len_data,replace = True)
#Unstack and Stack now. This will show the result quickly
ddata_i.groupby(["Name","Geo1","Geo2","Model"]).\
sum().\
unstack().\
stack(dropna=False).\
reset_index()
#Compress our data
ddata = ddata_i.copy()
df_obj = ddata.select_dtypes(include=['object']).copy()
for col in df_obj:
df_obj.loc[:, col] = df_obj[col].astype('category')
ddata[df_obj.columns] = df_obj
df_obj = ddata.select_dtypes(include=['float']).copy()
for col in df_obj:
df_obj.loc[:, col] = df_obj[col].astype('float')
ddata[df_obj.columns] = df_obj
#Let's quickly check whether compressed file is same as original file
assert ddata.shape==ddata_i.shape, "Output seems wrong"
assert ddata_i.ANDroid_Margin.sum()==ddata.ANDroid_Margin.sum(),"Sum isn't right"
for col in ["ANDroid_Margin","Name","Geo1","Geo2"]:
assert sorted(list(ddata_i[col].unique()))==sorted(list(ddata[col].unique()))
#This will run forever
ddata.groupby(["Name","Geo1","Geo2","Model"]).\
sum().\
unstack().\
stack(dropna=False).\
reset_index()
You will note that stacking and unstacking operation on ddata_i
will run quickly, but not on compressed ddata
. Why is this?
Also, I noticed that if I compress either object
or float
, then stack()
and unstack()
will run quickly. It's only when I do both, the problem persists.
Can someone please help me understand what I am missing? How can I fix the problem with pandas
above ? I feel that with such big performance issues, how can I write production-ready code in pandas
? I'd appreciate your thoughts.
Finally, here's R's data.table
code. I have to say that data.table
is not only fast, but I don't have to go through compressing and decompressing.
df <- data.table::fread("ddata_redact.csv",
stringsAsFactors=FALSE,
data.table = TRUE,
header = TRUE)
df1=data.table::dcast(df, Name + Geo1 + Geo2 ~ Model,
value.var = "ANDroid_Margin",
fun.aggregate = sum)
Can someone please help me understand what I am missing? How can I fix the problem with pandas
above ? I feel that with such big performance issues, how can I write production-ready code in pandas
? I'd appreciate your thoughts.
Python's sys info:
sys.version_info
> sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)
pandas version
pd.__version__
> '0.23.4'
data.table
1.11.8
Pandas provides various built-in methods for reshaping DataFrame. Among them, stack() and unstack() are the 2 most popular methods for restructuring columns and rows (also known as index). stack() : stack the prescribed level(s) from column to row. unstack() : unstack the prescribed level(s) from row to column.
Pandas' to_csv function supports a parameter compression . By default it takes the value 'infer' , which infers the compression mode from the destination path provided.
In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.
Pandas is all around excellent. But Pandas isn't particularly fast. When you're dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you're dealing with millions of computations, your total computation time stretches on and on and on....
I figured out the answer. The issue is that we need to add observed = True
to prevent pandas
from computing cartesian product.
After compression, I had to run this...
ddata.groupby(["Name","Geo1","Geo2","Model",observed = True]).\
sum().\
unstack().\
stack(dropna=False).\
reset_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With