Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column)

Tags:

I have the following dataframe:

   obj_id   data_date   value 0  4        2011-11-01  59500     1  2        2011-10-01  35200  2  4        2010-07-31  24860    3  1        2009-07-28  15860 4  2        2008-10-15  200200 

I want to get a subset of this data so that I only have the most recent (largest 'data_date') 'value' for each 'obj_id'.

I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.

My method is essentially to group, sort, retrieve, and recombine as follows:

row_arr = [] for grp, grp_df in df.groupby('obj_id'):     row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0])  df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value')) 
like image 555
enrishi Avatar asked Mar 24 '12 10:03

enrishi


People also ask

How do you count occurrences of specific value in Pandas column?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you find the maximum value of a column in a data frame?

To find the maximum value of a column and to return its corresponding row values in Pandas, we can use df. loc[df[col]. idxmax()].

How do I get the last value for Pandas?

Python3. Pandas iloc is used to retrieve data by specifying its integer index. In python negative index starts from end therefore we can access the last element by specifying index to -1 instead of length-1 which will yield the same result.

What does .values in Pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


2 Answers

If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.

sorted = df.sort_index(by='data_date') result = sorted.drop_duplicates('obj_id', keep='last').values 

This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.

like image 92
thetainted1 Avatar answered Oct 08 '22 15:10

thetainted1


This is another possible solution. Dont know if this is the fastest (I doubt..) since I have not benchmarked it against other approaches.

df.loc[df.groupby('obj_id').data_date.idxmax(),:] 
like image 39
pdifranc Avatar answered Oct 08 '22 14:10

pdifranc