Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better more readable way to coalese columns in pandas

Tags:

python

pandas

I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df

Results

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]

Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)

I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1

Can you suggest something nicer/faster... or confirm that this is as good as it gets.

like image 298
user2315423 Avatar asked Oct 09 '15 21:10

user2315423


People also ask

Is query faster than LOC pandas?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

How do you coalesce two columns in pandas?

Use the combine_first() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame. The combine_first() method fills null values in one dataframe with non-null data from the second dataframe to combine two dataframe objects.

How can you speed up computations with pandas?

Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


1 Answers

The Pandas equivalent to COALESCE is the method fillna():

result = column_a.fillna(column_b)

The result is a column where each value is taken from column_a if that column provides a non-null value, otherwise the value is taken from column_b. So your combo1 can be produced with:

df['first'].fillna(df['second']).fillna(df['third'])

giving:

0       A
1       C
2       B
3    None
4       A

And your combo2 can be produced with:

(df['second']).fillna(df['third']).fillna(df['first'])

which returns the new column:

0       C
1       C
2       B
3    None
4       B

If you wanted an efficient operation called coalesce, it could simply combine columns with fillna() from left to right and then return the result:

def coalesce(df, column_names):
    i = iter(column_names)
    column_name = next(i)
    answer = df[column_name]
    for column_name in i:
        answer = answer.fillna(df[column_name])
    return answer

print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])

which gives:

0       A
1       C
2       B
3    None
4       A

0       C
1       C
2       B
3    None
4       B
like image 183
Brandon Rhodes Avatar answered Sep 21 '22 01:09

Brandon Rhodes