I have a data frame that I want to remove duplicates that are consecutive (in base). I know rle
may be helpful here but can't think of how to use it. The example output will help to illuminate what I'm asking for.
Generate sample data:
set.seed(12)
samps <- sample(1:5, 20, T)
dat <- data.frame(v1=LETTERS[samps], v2=month.abb[samps])
dat[10, 2] <- "Mar"
Sample data:
v1 v2
1 A Jan
2 E May
3 E May
4 B Feb
5 A Jan
6 A Jan
7 A Jan
8 D Apr
9 A Jan
10 A Mar
11 B Feb
12 E May
13 B Feb
14 B Feb
15 B Feb
16 C Mar
17 C Mar
18 C Mar
19 D Apr
20 A Jan
Desired outcome:
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
20 A Jan
To drop consecutive duplicates with Python Pandas, we can use shift . to check if the last column isn't equal the current one with a. shift(-1) !=
You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Here's a way, not with rle
, but a way none-the-less:
dat[with(dat, c(TRUE, diff(as.numeric(interaction(v1, v2))) != 0)), ]
This assumes you're using factor
columns, as your sample data implies.
Here a fast solution using filter
dat[(filter(dat,c(-1,1))!= 0)[,1],]
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
NA <NA> <NA>
You need to add the last value of the original data to the result.
Using rle
I came up with this
ind <- cumsum(rle(as.character(dat$v1))$length)
dat[ind, ]
ind
indicates either the first or the last of consecutive entries.
EDIT:
A simple solution to Matthews comment would be
dat[15, 2] <- "May"
dat[cumsum(rle(paste0(dat$v1, dat$v2))$length), ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With