sort dask dataframes by multiple columns some ascending, some descending

Question

Im converting pandas to dask, main problem so far is sorting.

For converting simple sorts Im using nlargest for complex sorting, like:

df = df.sort_values(
            by=['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7'],
            ascending=[1, 0, 0, 0, 0, 0, 0]
        )

Im converting to pandas and then back to dask: dd.from_pandas

for this: ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])

I don't know what to do

Im assuming converting to pandas and back to dask slows down things (no idea how terrible it is)

Can nlargest handle this? I don't see how to achieve one column descending and the other ascending.

Carlos P Ceballos · Accepted Answer

Trying to expand the conversation: Maybe it is not about replacing sort_values but re writing the whole thing in a dask friendly way:

After:

ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])

came:

ar = ar.groupby(['column_1']).first()

These two line could be re written in one dask friendly line:

ar = ar.groupby(['column_1']).agg({'column_2': 'max'})

I don't consider this an answer to the question, still looking for ways to deal with sort_values, maybe there are multiple ways.

sort dask dataframes by multiple columns some ascending, some descending

Tags:

sorting

pandas

dataframe

dask

Carlos P Ceballos

1 Answers

Carlos P Ceballos

Recent Activity

Donate For Us

sort dask dataframes by multiple columns some ascending, some descending

Tags:

sorting

pandas

dataframe

dask

Carlos P Ceballos

1 Answers

Carlos P Ceballos

Related questions

Recent Activity

Donate For Us