I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a*
and b*
columns to a
and b
, respectively.
Is there an efficient way to do such an operation?
You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.
The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.
You can use the attribute df. empty to check whether it's empty or not. True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With