Im converting pandas to dask, main problem so far is sorting.
For converting simple sorts Im using nlargest for complex sorting, like:
df = df.sort_values(
by=['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7'],
ascending=[1, 0, 0, 0, 0, 0, 0]
)
Im converting to pandas and then back to dask: dd.from_pandas
for this:
ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])
I don't know what to do
Im assuming converting to pandas and back to dask slows down things (no idea how terrible it is)
Can nlargest handle this? I don't see how to achieve one column descending and the other ascending.
Trying to expand the conversation: Maybe it is not about replacing sort_values but re writing the whole thing in a dask friendly way:
After:
ar = ar.sort_values(by=['column_1', 'column_2'], ascending=[1, 0])
came:
ar = ar.groupby(['column_1']).first()
These two line could be re written in one dask friendly line:
ar = ar.groupby(['column_1']).agg({'column_2': 'max'})
I don't consider this an answer to the question, still looking for ways to deal with sort_values, maybe there are multiple ways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With