I want to find an alternative of pandas.dataframe.sort_value function in dask.
I came through set_index, but it would sort on a single column.
How can I sort multiple columns of Dask data frame?
So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around.
d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1)
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())
Edit:
The above works if you want to sort by two strings. I recommend creating integer (or bytes) columns and then using struct.pack
to create a new composite bytes column. For example, if col1_dt
is a datetime and col2
is an integer:
import struct
# create a timedelta with seconds resolution.
# i know this is the resolution is correct
d['col1_int'] = ((d['col1_dt'] -
d['col1_dt'].min())/np.timedelta64(1,'s')
).astype(int)
d['new_column'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With