Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: shifting columns depending on if NaN or not

Tags:

python

pandas

I have a dataframe like so:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
                 NaN                     NaN                 8546987
             8316589                 8751369                     NaN
             4569874                     NaN                 2645981

I would like phone_number_1_clean to be as populated as possible. This will require shifting either phone_number_2_clean or phone_number_3_clean to phone_number_1_clean and vice versa meaning getting phone_number_2_clean as populated as possible if phone_number_1_clean is populated etc.

The output should look something like:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
             8546987                     NaN                     NaN
             8316589                 8751369                     NaN
             4569874                 2645981                     NaN

I might be able to do it np.wherestatements but could be messy.

The approach would preferably be vectorised as will be applied to large-ish dataframes.

like image 533
Auren Ferguson Avatar asked Jul 12 '18 11:07

Auren Ferguson


People also ask

How do I change a column value based on conditions in pandas?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

What is the best way to shift a pandas DataFrame column?

shift() If you want to shift your column or subtract the column value with the previous row value from the DataFrame, you can do it by using the shift() function. It consists of a scalar parameter called period, which is responsible for showing the number of shifts to be made over the desired axis.


1 Answers

Use:

#for each row remove NaNs and create new Series - rows in final df 
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or:

s = df.stack()
s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]

df1 = s.unstack().reindex(columns=range(len(df.columns)))
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or a bit changed justify function:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a) #changed to pandas notnull
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

df = pd.DataFrame(justify(df.values, invalid_val=np.nan),  
                  index=df.index, columns=df.columns)
print (df)
  phone_number_1_clean phone_number_2_clean phone_number_3_clean
0              8546987                  NaN                  NaN
1              8316589              8751369                  NaN
2              4569874              2645981                  NaN

Performance:

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)

In [442]: %%timeit
     ...: df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
     ...: #if possible different number of columns like original df is necessary reindex
     ...: df1 = df1.reindex(columns=range(len(df.columns)))
     ...: #assign original columns names
     ...: df1.columns = df.columns
     ...: 
1.17 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [443]: %%timeit
     ...: s = df.stack()
     ...: s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]
     ...: 
     ...: df1 = s.unstack().reindex(columns=range(len(df.columns)))
     ...: df1.columns = df.columns
     ...: 
     ...: 
5.88 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [444]: %%timeit
     ...: pd.DataFrame(justify(df.values, invalid_val=np.nan),
          index=df.index, columns=df.columns)
     ...: 
941 µs ± 131 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 115
jezrael Avatar answered Sep 29 '22 02:09

jezrael