pandas stack and unstack performance reduces after dataframe compression and is much worse than R's data.table

Q: What is the use of stack () and unstack () method in pandas?

Pandas provides various built-in methods for reshaping DataFrame. Among them, stack() and unstack() are the 2 most popular methods for restructuring columns and rows (also known as index). stack() : stack the prescribed level(s) from column to row. unstack() : unstack the prescribed level(s) from row to column.

Q: Does pandas compress the data?

Pandas' to_csv function supports a parameter compression . By default it takes the value 'infer' , which infers the compression mode from the destination path provided.

Q: Is pandas DataFrame efficient?

In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.

Tags:

python

pandas

r

data.table

This question is about boosting Pandas' performance during stacking and unstacking operation.

The issue is that I have a large dataframe (~2GB). I followed this blog to compress it to ~150MB successfully. However, my stacking and unstacking operation take infinite amount of time such that I have to kill the kernel and restart everything.

I have also used R's data.table package, and it just flies, meaning it completes the operation in <1 second.

I researched this on SO. It seems that some people have pointed to map-reduce on Dataframe unstack performance - pandas thread, but I am not sure about it for two reasons:

stack and unstack on uncompressed runs fine in pandas, but I can't do this on my original dataset because of memory problems.
R's data.table easily (<1 second) converts from long to wide format.

I managed to cut a small feed (5MB) for representation purpose for SO. The feed has been uploaded to http://www.filedropper.com/ddataredact. This file should be able to reproduce the problem.

Here's my pandas code:

Click to copy

import pandas as pd

#Added code to generate test data
data = {'ANDroid_Margin':{'type':'float','len':13347},
        'Name':{'type':'cat','len':71869},
        'Geo1':{'type':'cat','len':4},
        'Geo2':{'type':'cat','len':31},
        'Model':{'type':'cat','len':2}}

ddata_i = pd.DataFrame()
len_data =114348
#categorical
for colk,colv in data.items():
    print("Processing column:",colk)
    #Is the data type numeric?
    if data[colk]['type']=='cat':
        chars = string.digits + string.ascii_lowercase
        replacement_value = [
            "".join(
                [random.choice(chars) for i in range(5)]
            ) for j in range(data[colk]['len'])]

    else:
        replacement_value = np.random.uniform(
            low=0.0, high=20.0, size=(data[colk]['len'],))
    ddata_i[colk] = np.random.choice(
        replacement_value,size=len_data,replace = True)

#Unstack and Stack now. This will show the result quickly
ddata_i.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

#Compress our data
ddata = ddata_i.copy()

df_obj = ddata.select_dtypes(include=['object']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('category')
ddata[df_obj.columns] = df_obj

df_obj = ddata.select_dtypes(include=['float']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('float')
ddata[df_obj.columns] = df_obj

#Let's quickly check whether compressed file is same as original file
assert ddata.shape==ddata_i.shape, "Output seems wrong"
assert ddata_i.ANDroid_Margin.sum()==ddata.ANDroid_Margin.sum(),"Sum isn't right"
for col in ["ANDroid_Margin","Name","Geo1","Geo2"]:
    assert sorted(list(ddata_i[col].unique()))==sorted(list(ddata[col].unique()))

#This will run forever
ddata.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

You will note that stacking and unstacking operation on ddata_i will run quickly, but not on compressed ddata. Why is this?

Also, I noticed that if I compress either object or float, then stack() and unstack() will run quickly. It's only when I do both, the problem persists.

Can someone please help me understand what I am missing? How can I fix the problem with pandas above ? I feel that with such big performance issues, how can I write production-ready code in pandas? I'd appreciate your thoughts.

Finally, here's R's data.table code. I have to say that data.table is not only fast, but I don't have to go through compressing and decompressing.

Click to copy

df <- data.table::fread("ddata_redact.csv",
                        stringsAsFactors=FALSE,
                        data.table = TRUE, 
                        header = TRUE)

df1=data.table::dcast(df, Name + Geo1 + Geo2 ~ Model, 
                      value.var = "ANDroid_Margin",
                      fun.aggregate = sum)

Python's sys info:

Click to copy

sys.version_info
> sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)

pandas version

Click to copy

pd.__version__
> '0.23.4'

data.table

Click to copy

1.11.8

405

asked Dec 23 '18 05:12

watchtower

1 Answers

I figured out the answer. The issue is that we need to add observed = True to prevent pandas from computing cartesian product.

After compression, I had to run this...

Click to copy

ddata.groupby(["Name","Geo1","Geo2","Model",observed = True]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

195

answered Sep 29 '22 04:09

watchtower

Related questions
                            
                                Streaming video in memory with OpenCV VideoWriter and Python BytesIO
                            
                                how to quit/close Anaconda Navigator
                            
                                Use multiple directories for flow_from_directory in Keras
                            
                                Python Selenium Wait for user to click a button
                            
                                How to rotate image before save in Django?
                            
                                Python Timedelta64 convert days to months
                            
                                Keras Model with Maxpooling1D and channel_first
                            
                                Tensorflow 1.10 TFRecordDataset - recovering TFRecords
                            
                                gdb.execute blocks all the threads in python scripts
                            
                                Does importing a Python file also import the imported files into shell?
                            
                                How to get all characters of an arbitrary encoding?
                            
                                Python's _winapi module
                            
                                Why do I fail to predict y=x**4 with Keras? (y=x**3 works)
                            
                                BeautifulSoup Prettify custom new line option
                            
                                Map index of numpy matrix
                            
                                Pandas DataFrame: difference between rolling and expanding function
                            
                                Cannot take the length of Shape with unknown rank
                            
                                How to efficiently partial argsort Pandas dataframe across columns
                            
                                Python: monkey patch a function's source code
                            
                                pytest output results are garbled within pycharm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas stack and unstack performance reduces after dataframe compression and is much worse than R's data.table

Tags:

python

pandas

r

data.table

watchtower

People also ask

1 Answers

watchtower

Recent Activity

Donate For Us