Iam newbie in python. I have huge a dataframe
with millions of rows and id. my data looks like this:
Time ID X Y 8:00 A 23 100 9:00 B 24 110 10:00 B 25 120 11:00 C 26 130 12:00 C 27 140 13:00 A 28 150 14:00 A 29 160 15:00 D 30 170 16:00 C 31 180 17:00 B 32 190 18:00 A 33 200 19:00 C 34 210 20:00 A 35 220 21:00 B 36 230 22:00 C 37 240 23:00 B 38 250
I want to sort the data on id and time. The expected result what I looking for like this"
Time ID X Y 8:00 A 23 100 13:00 A 28 150 14:00 A 29 160 18:00 A 33 200 20:00 A 35 220 9:00 B 24 110 10:00 B 25 120 17:00 B 32 190 21:00 B 36 230 23:00 B 38 250 11:00 C 26 130 12:00 C 27 140 16:00 C 31 180 19:00 C 34 210 22:00 C 37 240 15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:
Time ID X Y 8:00 A 23 100 20:00 A 35 220 9:00 B 24 110 23:00 B 38 250 11:00 C 26 130 22:00 C 37 240 15:00 D 30 170
how to do it in pandas? thank you for your advice
Select first N Rows from a Dataframe using head() function In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe. If n is not provided then default value is 5.
Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.
Use groupby
, find the head
and tail
for each group, and concat
the two.
g = df.groupby('ID') (pd.concat([g.head(1), g.tail(1)]) .drop_duplicates() .sort_values('ID') .reset_index(drop=True)) Time ID X Y 0 8:00 A 23 100 1 20:00 A 35 220 2 9:00 B 24 110 3 23:00 B 38 250 4 11:00 C 26 130 5 22:00 C 37 240 6 15:00 D 30 170
If you can guarantee each ID group has at least two rows, the drop_duplicates
call is not needed.
Details
g.head(1) Time ID X Y 0 8:00 A 23 100 1 9:00 B 24 110 3 11:00 C 26 130 7 15:00 D 30 170 g.tail(1) Time ID X Y 7 15:00 D 30 170 12 20:00 A 35 220 14 22:00 C 37 240 15 23:00 B 38 250 pd.concat([g.head(1), g.tail(1)]) Time ID X Y 0 8:00 A 23 100 1 9:00 B 24 110 3 11:00 C 26 130 7 15:00 D 30 170 7 15:00 D 30 170 12 20:00 A 35 220 14 22:00 C 37 240 15 23:00 B 38 250
If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:
df.groupby('ID').apply(lambda x: df.iloc[[0, -1]])
As others have mentioned, it might be nice to also .drop_duplicates()
or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With