I'm trying to sort the following Pandas DataFrame:
RHS age height shoe_size weight
0 weight NaN 0.0 0.0 1.0
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
in such a way that the rows with a greater number of NaNs columns are positioned first. More precisely, in the above df, the row with index 1 (2 Nans) should come before ther row with index 0 (1 NaN).
What I do now is:
df.sort_values(by=['age', 'height', 'shoe_size', 'weight'], na_position="first")
Sorting by index We can sort it by using the dataframe. sort_index() function. Alternatively, you can sort the index in descending order by passing in the ascending=False the argument in the function above.
You can sort by column values in pandas DataFrame using sort_values() method. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.
In this article, Let’s discuss how to Sort rows or columns in Pandas Dataframe based on values. Pandas sort_values () method sorts a data frame in Ascending or Descending order of passed Column.
Here are 4 ways to select all rows with NaN values in Pandas DataFrame: (1) Using isna () to select all rows with NaN under a single DataFrame column:
As you can see, there are two columns that contain NaN values: The goal is to select all rows with the NaN values under the ‘ first_set ‘ column. Later, you’ll also see how to get the rows with the NaN values under the entire DataFrame.
And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course by: Single/List of column names to sort Data Frame by. axis: 0 or ‘index’ for rows and 1 or ‘columns’ for Column. ascending: Boolean value which sorts Data frame in ascending order if True.
Using df.sort_values
and loc
based accessing.
df = df.iloc[df.isnull().sum(1).sort_values(ascending=0).index]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
0 weight NaN 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
3 weight 3.0 0.0 0.0 1.0
df.isnull().sum(1)
counts the NaN
s and the rows are accessed based on this sorted count.
@ayhan offered a nice little improvement to the solution above, involving pd.Series.argsort
:
df = df.iloc[df.isnull().sum(axis=1).mul(-1).argsort()]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
df.isnull().sum().sort_values(ascending=False)
Here's a one-liner that will do it:
df.assign(Count_NA = lambda x: x.isnull().sum(axis=1)).sort_values('Count_NA', ascending=False).drop('Count_NA', axis=1)
# RHS age height shoe_size weight
# 1 shoe_size NaN 0.0 1.0 NaN
# 0 weight NaN 0.0 0.0 1.0
# 2 shoe_size 3.0 0.0 0.0 NaN
# 3 weight 3.0 0.0 0.0 1.0
# 4 age 3.0 0.0 0.0 1.0
This works by assigning a temporary column ("Count_NA") to count the NAs in each row, sorting on that column, and then dropping it, all in the same expression.
You can add a column of the number of null values, sort by that column, then drop the column. It's up to you if you want to use .reset_index(drop=True)
to reset the row count.
df['null_count'] = df.isnull().sum(axis=1)
df.sort_values('null_count', ascending=False).drop('null_count', axis=1)
# returns
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With