Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort rows of a dataframe in descending order of NaN counts

I'm trying to sort the following Pandas DataFrame:

         RHS  age  height  shoe_size  weight
0     weight  NaN     0.0        0.0     1.0
1  shoe_size  NaN     0.0        1.0     NaN
2  shoe_size  3.0     0.0        0.0     NaN
3     weight  3.0     0.0        0.0     1.0
4        age  3.0     0.0        0.0     1.0

in such a way that the rows with a greater number of NaNs columns are positioned first. More precisely, in the above df, the row with index 1 (2 Nans) should come before ther row with index 0 (1 NaN).

What I do now is:

df.sort_values(by=['age', 'height', 'shoe_size', 'weight'], na_position="first")
like image 612
Juan Carlos Avatar asked Aug 27 '17 22:08

Juan Carlos


People also ask

How do you sort DataFrame values in descending order?

Sorting by index We can sort it by using the dataframe. sort_index() function. Alternatively, you can sort the index in descending order by passing in the ascending=False the argument in the function above.

How do you arrange rows in descending order in Python?

You can sort by column values in pandas DataFrame using sort_values() method. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.

How do you sort DataFrame based on column values?

To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.

How to sort rows or columns in pandas Dataframe based on values?

In this article, Let’s discuss how to Sort rows or columns in Pandas Dataframe based on values. Pandas sort_values () method sorts a data frame in Ascending or Descending order of passed Column.

How to select all rows with NaN values in pandas Dataframe?

Here are 4 ways to select all rows with NaN values in Pandas DataFrame: (1) Using isna () to select all rows with NaN under a single DataFrame column:

How do I find the NaN values in a Dataframe?

As you can see, there are two columns that contain NaN values: The goal is to select all rows with the NaN values under the ‘ first_set ‘ column. Later, you’ll also see how to get the rows with the NaN values under the entire DataFrame.

How to sort data frame by Axis and column names?

And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course by: Single/List of column names to sort Data Frame by. axis: 0 or ‘index’ for rows and 1 or ‘columns’ for Column. ascending: Boolean value which sorts Data frame in ascending order if True.


Video Answer


4 Answers

Using df.sort_values and loc based accessing.

df = df.iloc[df.isnull().sum(1).sort_values(ascending=0).index]
print(df)

         RHS  age  height  shoe_size  weight
1  shoe_size  NaN     0.0        1.0     NaN
2  shoe_size  3.0     0.0        0.0     NaN
0     weight  NaN     0.0        0.0     1.0
4        age  3.0     0.0        0.0     1.0
3     weight  3.0     0.0        0.0     1.0

df.isnull().sum(1) counts the NaNs and the rows are accessed based on this sorted count.


@ayhan offered a nice little improvement to the solution above, involving pd.Series.argsort:

df = df.iloc[df.isnull().sum(axis=1).mul(-1).argsort()]
print(df)

         RHS  age  height  shoe_size  weight 
1  shoe_size  NaN     0.0        1.0     NaN           
0     weight  NaN     0.0        0.0     1.0           
2  shoe_size  3.0     0.0        0.0     NaN           
3     weight  3.0     0.0        0.0     1.0           
4        age  3.0     0.0        0.0     1.0            
like image 159
cs95 Avatar answered Oct 11 '22 14:10

cs95


df.isnull().sum().sort_values(ascending=False)
like image 24
Zainab Ali Avatar answered Oct 11 '22 14:10

Zainab Ali


Here's a one-liner that will do it:

df.assign(Count_NA = lambda x: x.isnull().sum(axis=1)).sort_values('Count_NA', ascending=False).drop('Count_NA', axis=1)
#          RHS  age  height  shoe_size  weight
# 1  shoe_size  NaN     0.0        1.0     NaN
# 0     weight  NaN     0.0        0.0     1.0
# 2  shoe_size  3.0     0.0        0.0     NaN
# 3     weight  3.0     0.0        0.0     1.0
# 4        age  3.0     0.0        0.0     1.0

This works by assigning a temporary column ("Count_NA") to count the NAs in each row, sorting on that column, and then dropping it, all in the same expression.

like image 2
cmaher Avatar answered Oct 11 '22 15:10

cmaher


You can add a column of the number of null values, sort by that column, then drop the column. It's up to you if you want to use .reset_index(drop=True) to reset the row count.

df['null_count'] = df.isnull().sum(axis=1)
df.sort_values('null_count', ascending=False).drop('null_count', axis=1)

# returns
         RHS  age  height  shoe_size  weight
1  shoe_size  NaN     0.0        1.0     NaN
0     weight  NaN     0.0        0.0     1.0
2  shoe_size  3.0     0.0        0.0     NaN
3     weight  3.0     0.0        0.0     1.0
4        age  3.0     0.0        0.0     1.0
like image 2
James Avatar answered Oct 11 '22 13:10

James