Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas group by and find first non null value for all columns

I have pandas DF as below ,

id  age   gender  country  sales_year
1   None   M       India    2016
2   23     F       India    2016
1   20     M       India    2015
2   25     F       India    2015
3   30     M       India    2019
4   36     None    India    2019

I want to group by on id, take the latest 1 row as per sales_date with all non null element.

output expected,

id  age   gender  country  sales_year
1   20     M       India    2016
2   23     F       India    2016
3   30     M       India    2019
4   36     None    India    2019

In pyspark,

df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))

But i need same solution in pandas .

EDIT :: This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.

like image 496
j ' Avatar asked Nov 26 '19 10:11

j '


People also ask

How can I get non null value in pandas?

Pandas DataFrame notnull() Method The notnull() method returns a DataFrame object where all the values are replaced with a Boolean value True for NOT NULL values, and otherwise False.

How do I see all null columns in pandas?

You can use df. isnull(). sum() . It shows all columns and the total NaNs of each feature.

How do you find which columns have missing values in pandas?

Checking for missing values using isnull() In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.

What is first () in pandas?

Pandas DataFrame first() Method The first() method returns the first n rows, based on the specified value. The index have to be dates for this method to work as expected.


Video Answer


1 Answers

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019
like image 183
jezrael Avatar answered Oct 01 '22 13:10

jezrael