We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question.
import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head()
output
col1 | col2 | col3 | col4
22 | Nan | 23 | 56
12 | 54 | 22 | 36
48 | Nan | 2 | 45
76 | 32 | 13 | 6
23 | Nan | 43 | 8
67 | 54 | 56 | 64
16 | 32 | 32 | 6
3 | 54 | 64 | 8
67 | NaN | 23 | 64
I want to replace the value of col4
with col1
if col4<col1
and col2
is not NaN
So the result should be
col1| col2 | col3 | col4
22 | Nan | 23 | 56
12 | 54 | 22 | 36
48 | Nan | 2 | 45
76 | 32 | 13 | 76
23 | Nan | 43 | 8
67 | 54 | 56 | 67
16 | 32 | 32 | 16
3 | 54 | 64 | 8
67 | NaN | 23 | 64
I know how to do it on pandas:
condition= df[(df['col4'] < df['col1']) & (pd.notnull(df['col2']))].index
df.loc[condition,'col4'] = df.loc[condition, 'col1'].values
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.
I think you need:
condition = (df['col4'] < df['col1']) & (pd.notnull(df['col2']))
df.loc[condition,'col4'] = df.loc[condition, 'col1']
Or dask.dataframe.Series.mask
:
df['col4'] = df['col4'].mask(condition, df['col1'])
print (df)
col1 col2 col3 col4
0 22 NaN 23 56
1 12 54.0 22 36
2 48 NaN 2 45
3 76 32.0 13 76
4 23 NaN 43 8
5 67 54.0 56 67
6 16 32.0 32 16
7 3 54.0 64 8
8 67 NaN 23 64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With