I have the following dataframe:
obj_id data_date value 0 4 2011-11-01 59500 1 2 2011-10-01 35200 2 4 2010-07-31 24860 3 1 2009-07-28 15860 4 2 2008-10-15 200200
I want to get a subset of this data so that I only have the most recent (largest 'data_date'
) 'value'
for each 'obj_id'
.
I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.
My method is essentially to group, sort, retrieve, and recombine as follows:
row_arr = [] for grp, grp_df in df.groupby('obj_id'): row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0]) df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value'))
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
To find the maximum value of a column and to return its corresponding row values in Pandas, we can use df. loc[df[col]. idxmax()].
Python3. Pandas iloc is used to retrieve data by specifying its integer index. In python negative index starts from end therefore we can access the last element by specifying index to -1 instead of length-1 which will yield the same result.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.
sorted = df.sort_index(by='data_date') result = sorted.drop_duplicates('obj_id', keep='last').values
This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.
This is another possible solution. Dont know if this is the fastest (I doubt..) since I have not benchmarked it against other approaches.
df.loc[df.groupby('obj_id').data_date.idxmax(),:]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With