Pandas: shifting columns depending on if NaN or not

Tags:

pandas

I have a dataframe like so:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
                 NaN                     NaN                 8546987
             8316589                 8751369                     NaN
             4569874                     NaN                 2645981

I would like phone_number_1_clean to be as populated as possible. This will require shifting either phone_number_2_clean or phone_number_3_clean to phone_number_1_clean and vice versa meaning getting phone_number_2_clean as populated as possible if phone_number_1_clean is populated etc.

The output should look something like:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
             8546987                     NaN                     NaN
             8316589                 8751369                     NaN
             4569874                 2645981                     NaN

I might be able to do it np.wherestatements but could be messy.

The approach would preferably be vectorised as will be applied to large-ish dataframes.

533

asked Jul 12 '18 11:07

Auren Ferguson

1 Answers

Use:

#for each row remove NaNs and create new Series - rows in final df 
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or:

s = df.stack()
s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]

df1 = s.unstack().reindex(columns=range(len(df.columns)))
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or a bit changed justify function:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a) #changed to pandas notnull
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

df = pd.DataFrame(justify(df.values, invalid_val=np.nan),  
                  index=df.index, columns=df.columns)
print (df)
  phone_number_1_clean phone_number_2_clean phone_number_3_clean
0              8546987                  NaN                  NaN
1              8316589              8751369                  NaN
2              4569874              2645981                  NaN

Performance:

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)

In [442]: %%timeit
     ...: df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
     ...: #if possible different number of columns like original df is necessary reindex
     ...: df1 = df1.reindex(columns=range(len(df.columns)))
     ...: #assign original columns names
     ...: df1.columns = df.columns
     ...: 
1.17 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [443]: %%timeit
     ...: s = df.stack()
     ...: s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]
     ...: 
     ...: df1 = s.unstack().reindex(columns=range(len(df.columns)))
     ...: df1.columns = df.columns
     ...: 
     ...: 
5.88 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [444]: %%timeit
     ...: pd.DataFrame(justify(df.values, invalid_val=np.nan),
          index=df.index, columns=df.columns)
     ...: 
941 µs ± 131 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

115

answered Sep 29 '22 02:09

jezrael

Related questions
                            
                                RxPy: Sort hot observable between (slow) scan executions
                            
                                How to select range of rows in Pandas?
                            
                                Django python paypalrestsdk - No 'Access-Control-Allow-Origin' and ppxo_unhandled_error error
                            
                                python: how to add a new key and a value in yaml file
                            
                                Catching specific error messages in try / except
                            
                                pyspark.sql.types.Row to list
                            
                                deploy django application with pipenv on apache
                            
                                Keras: Method on_batch_end() is slow but I have no callbacks?
                            
                                Add a custom wheel file as a dependency in setup.py?
                            
                                django.db.utils.OperationalError: (1052, "Column 'name' in field list is ambiguous")
                            
                                GridsearchCV: can't pickle function error when trying to pass lambda in parameter
                            
                                Pandas backfill specific value
                            
                                aws eb cli Windows get version error on colorama
                            
                                Is there a better/more efficient way to do this (vectorised)? Very slow performance with Pandas apply
                            
                                How to install Python Development tools on MSYS2
                            
                                How to send a CTRL-C signal to individual threads in Python?
                            
                                TypeError: '<' not supported between instances of 'NoneType' and 'str'
                            
                                Skip converting entities while loading a yaml string (using PyYAML)
                            
                                Initializing numpy array from np.empty
                            
                                numpy - efficiently copy values from matrix to matrix using some precalculated map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With