Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas stack and unstack performance reduces after dataframe compression and is much worse than R's data.table

This question is about boosting Pandas' performance during stacking and unstacking operation.

The issue is that I have a large dataframe (~2GB). I followed this blog to compress it to ~150MB successfully. However, my stacking and unstacking operation take infinite amount of time such that I have to kill the kernel and restart everything.

I have also used R's data.table package, and it just flies, meaning it completes the operation in <1 second.

I researched this on SO. It seems that some people have pointed to map-reduce on Dataframe unstack performance - pandas thread, but I am not sure about it for two reasons:

  1. stack and unstack on uncompressed runs fine in pandas, but I can't do this on my original dataset because of memory problems.
  2. R's data.table easily (<1 second) converts from long to wide format.

I managed to cut a small feed (5MB) for representation purpose for SO. The feed has been uploaded to http://www.filedropper.com/ddataredact. This file should be able to reproduce the problem.

Here's my pandas code:

import pandas as pd

#Added code to generate test data
data = {'ANDroid_Margin':{'type':'float','len':13347},
        'Name':{'type':'cat','len':71869},
        'Geo1':{'type':'cat','len':4},
        'Geo2':{'type':'cat','len':31},
        'Model':{'type':'cat','len':2}}

ddata_i = pd.DataFrame()
len_data =114348
#categorical
for colk,colv in data.items():
    print("Processing column:",colk)
    #Is the data type numeric?
    if data[colk]['type']=='cat':
        chars = string.digits + string.ascii_lowercase
        replacement_value = [
            "".join(
                [random.choice(chars) for i in range(5)]
            ) for j in range(data[colk]['len'])]

    else:
        replacement_value = np.random.uniform(
            low=0.0, high=20.0, size=(data[colk]['len'],))
    ddata_i[colk] = np.random.choice(
        replacement_value,size=len_data,replace = True)

#Unstack and Stack now. This will show the result quickly
ddata_i.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

#Compress our data
ddata = ddata_i.copy()

df_obj = ddata.select_dtypes(include=['object']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('category')
ddata[df_obj.columns] = df_obj

df_obj = ddata.select_dtypes(include=['float']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('float')
ddata[df_obj.columns] = df_obj

#Let's quickly check whether compressed file is same as original file
assert ddata.shape==ddata_i.shape, "Output seems wrong"
assert ddata_i.ANDroid_Margin.sum()==ddata.ANDroid_Margin.sum(),"Sum isn't right"
for col in ["ANDroid_Margin","Name","Geo1","Geo2"]:
    assert sorted(list(ddata_i[col].unique()))==sorted(list(ddata[col].unique()))

#This will run forever
ddata.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

You will note that stacking and unstacking operation on ddata_i will run quickly, but not on compressed ddata. Why is this?

Also, I noticed that if I compress either object or float, then stack() and unstack() will run quickly. It's only when I do both, the problem persists.

Can someone please help me understand what I am missing? How can I fix the problem with pandas above ? I feel that with such big performance issues, how can I write production-ready code in pandas? I'd appreciate your thoughts.


Finally, here's R's data.table code. I have to say that data.table is not only fast, but I don't have to go through compressing and decompressing.

df <- data.table::fread("ddata_redact.csv",
                        stringsAsFactors=FALSE,
                        data.table = TRUE, 
                        header = TRUE)

df1=data.table::dcast(df, Name + Geo1 + Geo2 ~ Model, 
                      value.var = "ANDroid_Margin",
                      fun.aggregate = sum)

Can someone please help me understand what I am missing? How can I fix the problem with pandas above ? I feel that with such big performance issues, how can I write production-ready code in pandas? I'd appreciate your thoughts.


Python's sys info:

sys.version_info
> sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)

pandas version

pd.__version__
> '0.23.4'

data.table

1.11.8
like image 405
watchtower Avatar asked Dec 23 '18 05:12

watchtower


People also ask

What is the use of stack () and unstack () method in pandas?

Pandas provides various built-in methods for reshaping DataFrame. Among them, stack() and unstack() are the 2 most popular methods for restructuring columns and rows (also known as index). stack() : stack the prescribed level(s) from column to row. unstack() : unstack the prescribed level(s) from row to column.

Does pandas compress the data?

Pandas' to_csv function supports a parameter compression . By default it takes the value 'infer' , which infers the compression mode from the destination path provided.

Is pandas DataFrame efficient?

In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.

Is pandas DataFrame slow?

Pandas is all around excellent. But Pandas isn't particularly fast. When you're dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you're dealing with millions of computations, your total computation time stretches on and on and on....


1 Answers

I figured out the answer. The issue is that we need to add observed = True to prevent pandas from computing cartesian product.

After compression, I had to run this...

ddata.groupby(["Name","Geo1","Geo2","Model",observed = True]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()
like image 195
watchtower Avatar answered Sep 29 '22 04:09

watchtower