Why my func run faster after I split pandas DataFrame into chunks comparing to simply do apply()?

Tags:

pandas

I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:

def shuffle_x(x):
    x = x.copy()
    np.random.shuffle(x)

    return x


def shuffle_table(df):
    df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
    return df_shuffled

Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df), this is really slow, takes more than 1500 seconds. However, if I do something like this:

df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)

This is much faster and only takes 60 seconds

My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.

Besides, the fastest way that I can come up with is:

tmp_d = {}
for col in df.columns:
    tmp_val = df[col].values
    np.random.shuffle(tmp_val)
    tmp_d[col] = tmp_val

df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]

This takes approximately 15 secs

617

asked Aug 06 '18 19:08

Eric He

1 Answers

It's faster because it's not doing the same thing.

To fully shuffle a sequence ensuring complete randomization requires at least O(n) time. So the bigger your DataFrame the longer it will take to shuffle.

Your second example is not equivalent, because it's not fully random. It only shuffles individual chunks. If there is a column like [1, 2, 3, ..., 29999, 30000], your second method will never, for instance, generate a result like [1, 30000, 2, 29999, ...], because it will never shuffle together the beginning of the sequence with the end. There are many possible shuffles that can't be achieved with the chunk-based shuffling.

In theory if you split your DataFrame into 100 equal-sized chunks, you would expect each one to shuffle 100 times faster than the whole. Based on your timings it looks like it's actually taking longer than this for the sub-shuffles, which I would guess is at least partly due to the overhead of creating the subtables in the first place.

178

answered Oct 12 '22 16:10

BrenBarn

Related questions
                            
                                AttributeError: module 'plotly' has no attribute 'plotly'
                            
                                plotly basic example shows no plot in jupyter lab
                            
                                Use custom function in ModelViewSet with Django Rest Framework
                            
                                How can I override Django Rest Framework viewset perform_create() method to set a default value to a field
                            
                                Array in a program
                            
                                python - Sum of digits in 2^1000? [duplicate]
                            
                                How to get exactly one element from list in Jinja?
                            
                                PyQt5 - How to display image in QMainWindow class?
                            
                                create a tensor proto whose content is larger than 2GB
                            
                                how to import selenium WebElement in python
                            
                                Why doesn't concurrent.futures.ThreadPoolExecutor().submit return immediately?
                            
                                Why am I getting a KeyError when trying to use an Enum as a dictionary key in another file?
                            
                                Create barplot from string data using groupby and multiple columns in pandas dataframe
                            
                                How can i extract day of week from timestamp in pandas
                            
                                Python Pandas: Merge Columns of Data Frame with column name into one column
                            
                                Python Selenium Right Click
                            
                                "No such file or directory" when using Windows Linux Subsystem bash with VS Code
                            
                                NumPy: Selecting n points every m points
                            
                                Gradient orientation in OpenCV
                            
                                How to run Ta-Lib on multiple columns of a Pandas dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With