Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame: copy the contents of a column if it is empty

I have the following DataFrame with named columns and index:

  'a'     'a*'    'b'    'b*'
1  5      NaN     9      NaN
2  NaN    3       3      NaN
3  4      NaN     1      NaN
4  NaN    9       NaN    7

The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.

I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.

Is there an efficient way to do such an operation?

like image 588
hamslice Avatar asked Nov 20 '18 09:11

hamslice


People also ask

How do I get the contents of a column in pandas?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

Why do we use copy () in pandas?

The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.

How do I check if a pandas column is empty?

You can use the attribute df. empty to check whether it's empty or not. True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.


3 Answers

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*
0   5.0 NaN 9.0 NaN
1   3.0 3.0 3.0 NaN
2   4.0 NaN 1.0 NaN
3   9.0 9.0 7.0 7.0
like image 130
Sociopath Avatar answered Oct 09 '22 15:10

Sociopath


Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

def using_fillna(df):
    return df['A'].fillna(df['A*'])

def using_combine_first(df):
    return df['A'].combine_first(df['A*'])

def using_np_where(df):
    return np.where(df['A'].isnull(), df['A*'], df['A'])

def using_np_where_numpy(df):
    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)

1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 22
user3471881 Avatar answered Oct 09 '22 15:10

user3471881


For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
     a    b
1  5.0  9.0
2  3.0  3.0
3  4.0  1.0
4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
                   'A*': [4, 4, 5, 6, 7, 8] * 10000})

def using_fillna(df):
    df['A'] = df['A'].fillna(df['A*'])
    return df

def using_np_where(df):
    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
    return df

def using_np_where_numpy(df):
    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
    return df

def using_combine_first(df):
    df['D'] = df['A'].combine_first(df['A*'])
    return df

%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)

1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 34
jezrael Avatar answered Oct 09 '22 16:10

jezrael