Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get only the first and last rows of each group with pandas

Tags:

Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:

Time    ID  X   Y 8:00    A   23  100 9:00    B   24  110 10:00   B   25  120 11:00   C   26  130 12:00   C   27  140 13:00   A   28  150 14:00   A   29  160 15:00   D   30  170 16:00   C   31  180 17:00   B   32  190 18:00   A   33  200 19:00   C   34  210 20:00   A   35  220 21:00   B   36  230 22:00   C   37  240 23:00   B   38  250 

I want to sort the data on id and time. The expected result what I looking for like this"

Time    ID  X   Y 8:00    A   23  100 13:00   A   28  150 14:00   A   29  160 18:00   A   33  200 20:00   A   35  220 9:00    B   24  110 10:00   B   25  120 17:00   B   32  190 21:00   B   36  230 23:00   B   38  250 11:00   C   26  130 12:00   C   27  140 16:00   C   31  180 19:00   C   34  210 22:00   C   37  240 15:00   D   30  170 

and I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:

Time    ID  X   Y 8:00    A   23  100 20:00   A   35  220 9:00    B   24  110 23:00   B   38  250 11:00   C   26  130 22:00   C   37  240 15:00   D   30  170 

how to do it in pandas? thank you for your advice

like image 233
Arief Avatar asked Dec 26 '18 04:12

Arief


People also ask

How do I select top and rows in pandas?

Select first N Rows from a Dataframe using head() function In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe. If n is not provided then default value is 5.

How do you get the last 5 rows in pandas?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.


2 Answers

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')  (pd.concat([g.head(1), g.tail(1)])    .drop_duplicates()    .sort_values('ID')    .reset_index(drop=True))      Time ID   X    Y 0   8:00  A  23  100 1  20:00  A  35  220 2   9:00  B  24  110 3  23:00  B  38  250 4  11:00  C  26  130 5  22:00  C  37  240 6  15:00  D  30  170 

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.


Details

g.head(1)      Time ID   X    Y 0   8:00  A  23  100 1   9:00  B  24  110 3  11:00  C  26  130 7  15:00  D  30  170  g.tail(1)       Time ID   X    Y 7   15:00  D  30  170 12  20:00  A  35  220 14  22:00  C  37  240 15  23:00  B  38  250  pd.concat([g.head(1), g.tail(1)])       Time ID   X    Y 0    8:00  A  23  100 1    9:00  B  24  110 3   11:00  C  26  130 7   15:00  D  30  170 7   15:00  D  30  170 12  20:00  A  35  220 14  22:00  C  37  240 15  23:00  B  38  250 
like image 138
cs95 Avatar answered Sep 28 '22 06:09

cs95


If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:

df.groupby('ID').apply(lambda x: df.iloc[[0, -1]]) 

As others have mentioned, it might be nice to also .drop_duplicates() or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.

like image 21
johnnybarrels Avatar answered Sep 28 '22 06:09

johnnybarrels