I am trying to replace duplicate values across multiple columns in a pandas df. For the df below I have dates and values respective to these dates. All values are the same for each date. I just want to keep the first value for each date and replace the following duplicate values with np.nan. Below is my attempt:
import pandas as pd
import numpy as np
d = ({
'Date' : ['1/1/18','1/1/18','1/1/18','2/1/18','2/1/18','3/2/18','3/2/18','3/2/18'],
'Val_D' : [10,10,10,22,22,10,10,10],
'Val_M' : [100,100,100,100,100,240,240,240],
})
df = pd.DataFrame(data = d)
df['Date'] = pd.to_datetime(df['Date'], format= '%d/%m/%y')
pd.Series([10,22,100,240]).duplicated()
dup = df.apply(pd.Series.duplicated, axis = 1)
df = df.where(~dup,np.nan)
print(df)
Intended Output:
Date Val_D Val_M
0 1/1/18 10 100
1 1/1/18
2 1/1/18
3 2/1/18 22
4 2/1/18
5 3/2/18 10 240
6 3/2/18
7 3/2/18
Well, one way is to simply use diff+ne
s = df[['Val_D', 'Val_M']]
df[['Val_D', 'Val_M']] = s[s.diff().ne(0)].fillna('')
Even though this yields your expected output, the dtypes of your columns become object and you lose your vectorization power with numbers. So I'd suggest you don't do that. Without the last piece fillna(''), you get
Date Val_D Val_M
0 2018-01-01 10.0 100.0
1 2018-01-01 NaN NaN
2 2018-01-01 NaN NaN
3 2018-01-02 22.0 NaN
4 2018-01-02 NaN NaN
5 2018-02-03 10.0 240.0
6 2018-02-03 NaN NaN
7 2018-02-03 NaN NaN
The dtypes are float. Now, with the fillna('') piece, you get
Date Val_D Val_M
0 2018-01-01 10 100
1 2018-01-01
2 2018-01-01
3 2018-01-02 22
4 2018-01-02
5 2018-02-03 10 240
6 2018-02-03
7 2018-02-03
with object dtypes.
I am using where with diff
df[['Val_D', 'Val_M']].where(df[['Val_D', 'Val_M']].diff().ne(0))
Out[1136]:
Val_D Val_M
0 10.0 100.0
1 NaN NaN
2 NaN NaN
3 22.0 NaN
4 NaN NaN
5 10.0 240.0
6 NaN NaN
7 NaN NaN
df[['Val_D', 'Val_M']]= df[['Val_D', 'Val_M']].where(df[['Val_D', 'Val_M']].diff().ne(0))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With