I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:
def shuffle_x(x):
x = x.copy()
np.random.shuffle(x)
return x
def shuffle_table(df):
df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
return df_shuffled
Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df)
, this is really slow, takes more than 1500 seconds. However, if I do something like this:
df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)
This is much faster and only takes 60 seconds
My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.
Besides, the fastest way that I can come up with is:
tmp_d = {}
for col in df.columns:
tmp_val = df[col].values
np.random.shuffle(tmp_val)
tmp_d[col] = tmp_val
df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]
This takes approximately 15 secs
Numpy supports several operations directly on arrays – hence to add 1 to every element of an array, we can simply do col=col+1 on the array. We once again see that this turns out to be faster than doing a similar operation on the pandas dataframe column.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
It's faster because it's not doing the same thing.
To fully shuffle a sequence ensuring complete randomization requires at least O(n) time. So the bigger your DataFrame the longer it will take to shuffle.
Your second example is not equivalent, because it's not fully random. It only shuffles individual chunks. If there is a column like [1, 2, 3, ..., 29999, 30000]
, your second method will never, for instance, generate a result like [1, 30000, 2, 29999, ...]
, because it will never shuffle together the beginning of the sequence with the end. There are many possible shuffles that can't be achieved with the chunk-based shuffling.
In theory if you split your DataFrame into 100 equal-sized chunks, you would expect each one to shuffle 100 times faster than the whole. Based on your timings it looks like it's actually taking longer than this for the sub-shuffles, which I would guess is at least partly due to the overhead of creating the subtables in the first place.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With