I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that? Case in point. Let's say I have a data frame that looks like this: <pre class="prettyprint"><code>Index A B 1 15 NaN 2 NaN 11 3 NaN 99 4 NaN NaN 5 12 14 </code></pre> Let's say my desired output from this is to create a new column C such that my data frame will look like the following: <pre class="prettyprint"><code>Index A B C 1 15 NaN 15 2 NaN 11 11 3 NaN 99 99 4 NaN NaN NaN 5 12 14 12 (so giving priority to A over B) </code></pre> How can I accomplish this?

For a dataframe with an arbitrary number of columns, you can back fill the rows (<code>.bfill(axis=1)</code>) and take the first column (<code>.iloc[:, 0]</code>): <pre class="prettyprint"><code>df = pd.DataFrame({ 'A': [15, None, None, None, 12], 'B': [None, 11, 99, None, 14], 'C': [10, None, 10, 10, 10]}) df['D'] = df.bfill(axis=1).iloc[:, 0] >>> df A B C D 0 15 NaN 10 15 1 NaN 11 NaN 11 2 NaN 99 10 99 3 NaN NaN 10 10 4 12 14 10 12 </code></pre>

Pandas: take whichever column is not NaN

Tags:

python

pandas

I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?

Case in point. Let's say I have a data frame that looks like this:

Index   A     B
1       15    NaN
2       NaN   11
3       NaN   99
4       NaN   NaN
5       12    14

Let's say my desired output from this is to create a new column C such that my data frame will look like the following:

Index   A     B       C
1       15    NaN     15
2       NaN   11      11
3       NaN   99      99
4       NaN   NaN     NaN
5       12    14      12 (so giving priority to A over B)

How can I accomplish this?

945

asked Aug 16 '16 02:08

helloB

2 Answers

For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):

df = pd.DataFrame({
    'A': [15, None, None, None, 12],
    'B': [None, 11, 99, None, 14],
    'C': [10, None, 10, 10, 10]})

df['D'] = df.bfill(axis=1).iloc[:, 0]

>>> df
    A   B   C   D
0  15 NaN  10  15
1 NaN  11 NaN  11
2 NaN  99  10  99
3 NaN NaN  10  10
4  12  14  10  12

141

answered Sep 20 '22 23:09

Alexander

If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).

In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0    15.0
1    11.0
2    99.0
3     NaN
4    12.0
Name: A, dtype: float64

If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:

In [3]: df.max(axis=1)
Out[3]:
0    15.0
1    11.0
2    99.0
3     NaN
4    14.0
dtype: float64

answered Sep 20 '22 23:09

maxymoo

Related questions
                            
                                Python Killed: 9 when running a code using dictionaries created from 2 csv files
                            
                                Caffe: how to get the phase of a Python layer?
                            
                                Writable nested serializer in django-rest-framework?
                            
                                Interpolated sampling of points in an image with TensorFlow
                            
                                Make a column "immutable" in SQLAlchemy
                            
                                Element is not currently visible and so may not be interacted with, Selenium Dropdown Box Python
                            
                                How to group words whose Levenshtein distance is more than 80 percent in Python
                            
                                Re-order Pandas Series on weekday
                            
                                Why does this Keras model require over 6GB of memory?
                            
                                How to stack arrays and scalars in numpy?
                            
                                Create a random order of (x, y) pairs, without repeating/subsequent x's
                            
                                Python property factory or descriptor class for wrapping an external library
                            
                                Length mismatch error when assigning new column labels in pandas dataframe
                            
                                Mode of grouped data in (py)Spark
                            
                                python one-line if statement calling function if true
                            
                                Installing openCV in anaconda3 - Python.h: No such file or directory
                            
                                How to create a .pyd file?
                            
                                Insert multiple rows into DB with Python list of Tuples
                            
                                datetime get the hour in 2 digit format [duplicate]
                            
                                Python Manage Repositories Pycharm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With