I have a pandas data frame:
>>df_freq = pd.DataFrame([["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]], columns=list('ABC'))
>>df_freq
    A   B   C
0   Z11 Z11 X11
1   Y11     
2   Z11 Z11 
I want to make sure each row has unique values only. Therefore it should become like this: Removed values can be replaced with zero or empty
    A   B   C
0   Z11 0   X11
1   Y11     
2   Z11 0   
My data frame is big with hundreds of columns and thousands of rows. The goal is to count the unique values in that data frame. I do that by using converting data frame to matrix and applying
>>np.unique(mat.astype(str), return_counts=True)
But in certain row(s) the same value occurs and I want to remove that before applying np.unique() method. I want to keep unique values in each row.
use a combination of astype(bool) and duplicated
mask = df_freq.apply(pd.Series.duplicated, 1) & df_freq.astype(bool)
df_freq.mask(mask, 0)
     A  B    C
0  Z11  0  X11
1  Y11        
2  Z11  0     
                        Here's a vectorized NumPy approach -
def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]
    a = df_freq.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0
Sample run -
In [80]: df_freq
Out[80]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11
In [81]: reset_rowwise_dups(df_freq)
In [82]: df_freq
Out[82]: 
     A  B    C    D
0  Z11  0  X11    0
1  Y11            0
2  Z11  0       X11
Runtime test
# Proposed earlier in this post
def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]
    a = df.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0
# @piRSquared's soln using pandas apply
def apply_based(df):
    mask = df.apply(pd.Series.duplicated, 1) & df.astype(bool)
    return df.mask(mask, 0)
Timings -
In [151]: df_freq = pd.DataFrame([["Z11", "Z11", "X11", "Z11"], \
     ...:  ["Y11","","", "Y11"],["Z11","Z11","","X11"]], columns=list('ABCD'))
In [152]: df_freq
Out[152]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11
In [153]: df = pd.concat([df_freq]*10000,axis=0)
In [154]: df.index = range(df.shape[0])
In [155]: %timeit apply_based(df)
1 loops, best of 3: 3.35 s per loop
In [156]: %timeit reset_rowwise_dups(df)
100 loops, best of 3: 12.7 ms per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With