Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sort dask dataframes by multiple columns some ascending, some descending

Im converting pandas to dask, main problem so far is sorting.

For converting simple sorts Im using nlargest for complex sorting, like:

df = df.sort_values(
            by=['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7'],
            ascending=[1, 0, 0, 0, 0, 0, 0]
        )

Im converting to pandas and then back to dask: dd.from_pandas

for this: ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])

I don't know what to do

Im assuming converting to pandas and back to dask slows down things (no idea how terrible it is)

Can nlargest handle this? I don't see how to achieve one column descending and the other ascending.

like image 883
Carlos P Ceballos Avatar asked Nov 01 '25 16:11

Carlos P Ceballos


1 Answers

Trying to expand the conversation: Maybe it is not about replacing sort_values but re writing the whole thing in a dask friendly way:

After:

ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])

came:

ar = ar.groupby(['column_1']).first()

These two line could be re written in one dask friendly line:

ar = ar.groupby(['column_1']).agg({'column_2': 'max'})

I don't consider this an answer to the question, still looking for ways to deal with sort_values, maybe there are multiple ways.

like image 135
Carlos P Ceballos Avatar answered Nov 03 '25 08:11

Carlos P Ceballos



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!