Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why my func run faster after I split pandas DataFrame into chunks comparing to simply do apply()?

Tags:

python

pandas

I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:

def shuffle_x(x):
    x = x.copy()
    np.random.shuffle(x)

    return x


def shuffle_table(df):
    df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
    return df_shuffled

Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df), this is really slow, takes more than 1500 seconds. However, if I do something like this:

df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)

This is much faster and only takes 60 seconds

My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.

Besides, the fastest way that I can come up with is:

tmp_d = {}
for col in df.columns:
    tmp_val = df[col].values
    np.random.shuffle(tmp_val)
    tmp_d[col] = tmp_val

df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]

This takes approximately 15 secs

like image 617
Eric He Avatar asked Aug 06 '18 19:08

Eric He


People also ask

How do you simply make an operation on pandas Dataframe faster?

Numpy supports several operations directly on arrays – hence to add 1 to every element of an array, we can simply do col=col+1 on the array. We once again see that this turns out to be faster than doing a similar operation on the pandas dataframe column.

What is the most efficient way to loop through DataFrames with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

What is faster than pandas Dataframe?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.


1 Answers

It's faster because it's not doing the same thing.

To fully shuffle a sequence ensuring complete randomization requires at least O(n) time. So the bigger your DataFrame the longer it will take to shuffle.

Your second example is not equivalent, because it's not fully random. It only shuffles individual chunks. If there is a column like [1, 2, 3, ..., 29999, 30000], your second method will never, for instance, generate a result like [1, 30000, 2, 29999, ...], because it will never shuffle together the beginning of the sequence with the end. There are many possible shuffles that can't be achieved with the chunk-based shuffling.

In theory if you split your DataFrame into 100 equal-sized chunks, you would expect each one to shuffle 100 times faster than the whole. Based on your timings it looks like it's actually taking longer than this for the sub-shuffles, which I would guess is at least partly due to the overhead of creating the subtables in the first place.

like image 178
BrenBarn Avatar answered Oct 12 '22 16:10

BrenBarn