Iam newbie in python. I have huge a <code>dataframe</code> with millions of rows and id. my data looks like this: <pre class="prettyprint"><code>Time ID X Y 8:00 A 23 100 9:00 B 24 110 10:00 B 25 120 11:00 C 26 130 12:00 C 27 140 13:00 A 28 150 14:00 A 29 160 15:00 D 30 170 16:00 C 31 180 17:00 B 32 190 18:00 A 33 200 19:00 C 34 210 20:00 A 35 220 21:00 B 36 230 22:00 C 37 240 23:00 B 38 250 </code></pre> I want to sort the data on id and time. The expected result what I looking for like this" <pre class="prettyprint"><code>Time ID X Y 8:00 A 23 100 13:00 A 28 150 14:00 A 29 160 18:00 A 33 200 20:00 A 35 220 9:00 B 24 110 10:00 B 25 120 17:00 B 32 190 21:00 B 36 230 23:00 B 38 250 11:00 C 26 130 12:00 C 27 140 16:00 C 31 180 19:00 C 34 210 22:00 C 37 240 15:00 D 30 170 </code></pre> and I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this: <pre class="prettyprint"><code>Time ID X Y 8:00 A 23 100 20:00 A 35 220 9:00 B 24 110 23:00 B 38 250 11:00 C 26 130 22:00 C 37 240 15:00 D 30 170 </code></pre> how to do it in pandas? thank you for your advice

Use <code>groupby</code>, find the <code>head</code> and <code>tail</code> for each group, and <code>concat</code> the two. <pre class="prettyprint"><code>g = df.groupby('ID') (pd.concat([g.head(1), g.tail(1)]) .drop_duplicates() .sort_values('ID') .reset_index(drop=True)) Time ID X Y 0 8:00 A 23 100 1 20:00 A 35 220 2 9:00 B 24 110 3 23:00 B 38 250 4 11:00 C 26 130 5 22:00 C 37 240 6 15:00 D 30 170 </code></pre> If you can guarantee each ID group has at least two rows, the <code>drop_duplicates</code> call is not needed. <hr> Details <pre class="prettyprint"><code>g.head(1) Time ID X Y 0 8:00 A 23 100 1 9:00 B 24 110 3 11:00 C 26 130 7 15:00 D 30 170 g.tail(1) Time ID X Y 7 15:00 D 30 170 12 20:00 A 35 220 14 22:00 C 37 240 15 23:00 B 38 250 pd.concat([g.head(1), g.tail(1)]) Time ID X Y 0 8:00 A 23 100 1 9:00 B 24 110 3 11:00 C 26 130 7 15:00 D 30 170 7 15:00 D 30 170 12 20:00 A 35 220 14 22:00 C 37 240 15 23:00 B 38 250 </code></pre>

If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so: <pre class="prettyprint lang-py prettyprint-override"><code>df.groupby('ID').apply(lambda x: df.iloc[[0, -1]]) </code></pre> As others have mentioned, it might be nice to also <code>.drop_duplicates()</code> or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.

Get only the first and last rows of each group with pandas

Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:

Time    ID  X   Y 8:00    A   23  100 9:00    B   24  110 10:00   B   25  120 11:00   C   26  130 12:00   C   27  140 13:00   A   28  150 14:00   A   29  160 15:00   D   30  170 16:00   C   31  180 17:00   B   32  190 18:00   A   33  200 19:00   C   34  210 20:00   A   35  220 21:00   B   36  230 22:00   C   37  240 23:00   B   38  250

I want to sort the data on id and time. The expected result what I looking for like this"

Time    ID  X   Y 8:00    A   23  100 13:00   A   28  150 14:00   A   29  160 18:00   A   33  200 20:00   A   35  220 9:00    B   24  110 10:00   B   25  120 17:00   B   32  190 21:00   B   36  230 23:00   B   38  250 11:00   C   26  130 12:00   C   27  140 16:00   C   31  180 19:00   C   34  210 22:00   C   37  240 15:00   D   30  170

and I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:

Time    ID  X   Y 8:00    A   23  100 20:00   A   35  220 9:00    B   24  110 23:00   B   38  250 11:00   C   26  130 22:00   C   37  240 15:00   D   30  170

how to do it in pandas? thank you for your advice

How do I select top and rows in pandas?

Select first N Rows from a Dataframe using head() function In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe. If n is not provided then default value is 5.

How do you get the last 5 rows in pandas?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')  (pd.concat([g.head(1), g.tail(1)])    .drop_duplicates()    .sort_values('ID')    .reset_index(drop=True))      Time ID   X    Y 0   8:00  A  23  100 1  20:00  A  35  220 2   9:00  B  24  110 3  23:00  B  38  250 4  11:00  C  26  130 5  22:00  C  37  240 6  15:00  D  30  170

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.

Details

g.head(1)      Time ID   X    Y 0   8:00  A  23  100 1   9:00  B  24  110 3  11:00  C  26  130 7  15:00  D  30  170  g.tail(1)       Time ID   X    Y 7   15:00  D  30  170 12  20:00  A  35  220 14  22:00  C  37  240 15  23:00  B  38  250  pd.concat([g.head(1), g.tail(1)])       Time ID   X    Y 0    8:00  A  23  100 1    9:00  B  24  110 3   11:00  C  26  130 7   15:00  D  30  170 7   15:00  D  30  170 12  20:00  A  35  220 14  22:00  C  37  240 15  23:00  B  38  250

If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:

df.groupby('ID').apply(lambda x: df.iloc[[0, -1]])

As others have mentioned, it might be nice to also .drop_duplicates() or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.

Get only the first and last rows of each group with pandas

Tags:

Arief

People also ask

2 Answers

cs95

johnnybarrels

Recent Activity

Donate For Us

Get only the first and last rows of each group with pandas

Tags:

Arief

People also ask

2 Answers

cs95

johnnybarrels

Related questions

Recent Activity

Donate For Us