Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas force NaN to bottom of each column at each index

Tags:

python

pandas

I have a DataFrame where multiple rows span each index. Taking the first index, for example, has such a structure:

df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
            [np.NaN, np.NaN, 2.0, np.NaN, 2.0],
            [np.NaN, np.NaN, np.NaN, 3.0, 3.0]],
            columns=["ID", "Name", "val1", "val2", "val3"],
            index=[0, 0, 0])

Out[4]:
    ID   Name  val1  val2  val3
0    A  first     1     1   NaN
0  NaN    NaN     2   NaN     2
0  NaN    NaN   NaN     3     3

I would like to sort/order each column such that the NaNs are at the bottom of each column at that given index - a result which looks like this:

    ID   Name  val1  val2  val3
0    A  first     1     1     2
0  NaN    NaN     2     3     3
0  NaN    NaN   NaN     NaN   NaN

A more explicit example might look like this:

df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
            [np.NaN, np.NaN, 2.0, np.NaN, 2.0],
            [np.NaN, np.NaN, np.NaN, 3.0, 3.0],
            ["B", "second", 4.0, 4.0, np.NaN],
            [np.NaN, np.NaN, 5.0, np.NaN, 5.0],
            [np.NaN, np.NaN, np.NaN, 6.0, 6.0]],
            columns=[ "ID", "Name", "val1", "val2", "val3"],
            index=[0, 0, 0, 1, 1, 1])

Out[5]:
    ID    Name  val1  val2  val3
0    A   first     1     1   NaN
0  NaN     NaN     2   NaN     2
0  NaN     NaN   NaN     3     3
1    B  second     4     4   NaN
1  NaN     NaN     5   NaN     5
1  NaN     NaN   NaN     6     6

with the desired result to look like this:

    ID    Name  val1  val2  val3
0    A   first     1     1     2
0  NaN     NaN     2     3     3
0  NaN     NaN   NaN   NaN   NaN
1    B  second     4     4     5
1  NaN     NaN     5     6     6
1  NaN     NaN   NaN   NaN   NaN

I have many thousands of rows in this dataframe, with each index containing up to a few hundred rows. My desired result will be very helpful when I to_csv the dataframe.

I have attempted to use sort_values(['val1','val2','val3']) on the whole data frame, but this results in the indices becoming disordered. I have tried to iterate through each index and sort in place, but this too does not restrict the NaN to the bottom of each indices' column. I have also tried to fillna to another value, such as 0, but I have not been successful here, either.

While I am certainly using it wrong, the na_position parameter in sort_values does not produce the desired outcome, though it seems this is likely what want.

Edit:

The final df's index is not required to be in numerical order as in my second example.

By changing ignore_index to False in the single line of @Leb's third code block,

pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)

to

pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=False)

and by creating a temp df for all rows in a given index, I was able to make this work - not pretty, but it orders things how I need them. If someone (certainly) has a better way, please let me know.

new_df = df.ix[0]
new_df = pd.concat([new_df[col].sort_values().reset_index(drop=True) for col in new_df], axis=1, ignore_index=False)
max_index = df.index[-1]
for i in range(1, max_index + 1):
    tmp = df.ix[i]
    tmp = pd.concat([tmp[col].sort_values().reset_index(drop=True) for col in tmp], axis=1, ignore_index=False)
    new_df = pd.concat([new_df,tmp])


In [10]: new_df
Out[10]:
    ID    Name  val1  val2  val3
0    A   first     1     1     2
1  NaN     NaN     2     3     3
2  NaN     NaN   NaN   NaN   NaN
0    B  second     4     4     5
1  NaN     NaN     5     6     6
2  NaN     NaN   NaN   NaN   NaN
like image 619
AGS Avatar asked Nov 04 '15 19:11

AGS


People also ask

What does Fillna do in pandas?

Pandas DataFrame fillna() Method The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.

Does pandas STD ignore NaN?

std() function to calculate the standard deviation of values in a pandas DataFrame. Note that the std() function will automatically ignore any NaN values in the DataFrame when calculating the standard deviation.


1 Answers

I know the issue of pushing nans to an edge has been discussed on github. For your particular frame, I'd probably do it manually at the Python level, and not worry about performance much. Something like

>>> df.groupby(level=0, sort=False).transform(lambda x: sorted(x,key=pd.isnull))
    ID    Name  val1  val2  val3
0    A   first     1     1     2
0  NaN     NaN     2     3     3
0  NaN     NaN   NaN   NaN   NaN
1    B  second     4     4     5
1  NaN     NaN     5     6     6
1  NaN     NaN   NaN   NaN   NaN

should work. Note that since sorted is a stable sort, and we're using pd.isnull as the key (where False < True), we push the NaNs to the end while preserving the order of the remaining objects. Also note that here I'm grouping just on the index; we could alternatively have grouped on whatever we wanted.

like image 145
DSM Avatar answered Oct 17 '22 17:10

DSM