I have the following dataframe:
a a a b c c d e a a b b b e e d d
The required result should be
a b c d e a b e d
It means no two consecutive rows should have same value. How it can be done without using loop.
As my data set is quite huge, looping is taking lot of time to execute.
The dataframe structure is like the following
a 1 a 2 a 3 b 2 c 4 c 1 d 3 e 9 a 4 a 8 b 10 b 199 e 2 e 5 d 4 d 10
Result:
a 1 b 2 c 4 d 3 e 9 a 4 b 10 e 2 d 4
Its should delete the entire row.
To drop consecutive duplicates with Python Pandas, we can use shift . to check if the last column isn't equal the current one with a. shift(-1) !=
If you use 'uniq' command without any arguments, it will remove all consecutive duplicate lines and display only the unique lines.
One easy way is to use rle
:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d") # Read 17 items
rle
returns a list
with two values: the run length ("lengths
"), and the value that is repeated for that run ("values
").
rle(x)$values # [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
data.frame
If you are working with a data.frame
, try something like the following:
## Sample data mydf <- data.frame( V1 = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "e", "e", "d", "d"), V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199, 2, 5, 4, 10) ) ## Use rle, as before X <- rle(mydf$V1) ## Identify the rows you want to keep Y <- cumsum(c(1, X$lengths[-length(X$lengths)])) Y # [1] 1 4 5 7 8 9 11 13 15 mydf[Y, ] # V1 V2 # 1 a 1 # 4 b 2 # 5 c 4 # 7 d 3 # 8 e 9 # 9 a 4 # 11 b 10 # 13 e 2 # 15 d 4
The "data.table" package has a function rleid
that lets you do this quite easily. Using mydf
from above, try:
library(data.table) as.data.table(mydf)[, .SD[1], by = rleid(V1)] # rleid V2 # 1: 1 1 # 2: 2 2 # 3: 3 4 # 4: 4 3 # 5: 5 9 # 6: 6 4 # 7: 7 10 # 8: 8 2 # 9: 9 4
library(dplyr) x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d") x[x!=lag(x, default=1)] #[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
EDIT: For data.frame
mydf <- data.frame( V1 = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "e", "e", "d", "d"), V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199, 2, 5, 4, 10), stringsAsFactors=FALSE)
dplyr solution is one liner:
mydf %>% filter(V1!= lag(V1, default="1")) # V1 V2 #1 a 1 #2 b 2 #3 c 4 #4 d 3 #5 e 9 #6 a 4 #7 b 10 #8 e 2 #9 d 4
post scriptum
lead(x,1)
suggested by @Carl Witthoft iterates in reverse order.
leadit<-function(x) x!=lead(x, default="what") rows <- leadit(mydf[ ,1]) mydf[rows, ] # V1 V2 #3 a 3 #4 b 2 #6 c 1 #7 d 3 #8 e 9 #10 a 8 #12 b 199 #14 e 5 #16 d 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With