how to remove partial duplicates from a data frame?

Tags:

r

Data I'm importing describes numeric measurements taken at various locations for more or less evenly spread timestamps. sometimes this "evenly spread" is not really true and I have to discard some of the values, it's not that important which one, as long as I have one value for each timestamp for each location.

what I do with the data? I add it to a result data.frame. There I have a timestamp column and the values in the timestamp column, they are definitely evenly spaced according to the step.

timestamps <- ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60 + epoch
result[result$timestamp %in% timestamps, columnName] <- values

This does NOT work when I have timestamps that fall in the same time step. This is an example:

> data.frame(ts=timestamps, v=values)
                   ts         v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:04:18 -2.079778
3 2009-09-30 10:07:47 -2.113531
4 2009-09-30 10:09:01 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:27:56 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144
> data.frame(ts=ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60+epoch,
+ v=values)
                   ts         v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:15:00 -2.079778
3 2009-09-30 10:15:00 -2.113531
4 2009-09-30 10:15:00 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:30:00 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144

in Python I would (mis)use a dictionary to achieve what I need:

dict(zip(timestamps, values)).items()

returns a list of pairs where the first coordinate is unique.

in R I don't know how to do it in a compact and efficient way.

772

asked Nov 20 '09 09:11

mariotomo

2 Answers

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

R> df_ <- read.table(textConnection('
                     ts         v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
                   ts      v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1))  # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x))  # max value

answered Sep 25 '22 14:09

rcs

I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:

R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
            order.by=Sys.time() + sort(runif(10,10,300)))
R> X
                        val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
2009-11-20 07:10:11  1.0018
2009-11-20 07:10:12  1.0747
2009-11-20 07:10:58  0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
                        val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
R>

The X object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.

answered Sep 22 '22 14:09

Dirk Eddelbuettel

Related questions
                            
                                More elegant way to return a sequence of numbers based on booleans?
                            
                                How do I map a vector of values to another vector with my own custom map in R [duplicate]
                            
                                convert simple triplet matrix(slam) to sparse matrix(Matrix) in R
                            
                                Calculate a 2D spline curve in R
                            
                                Unique rows, considering two columns, in R, without order
                            
                                data.table alternative for dplyr mutate?
                            
                                R_Extracting coordinates from SpatialPolygonsDataFrame
                            
                                highlight areas within certain x range in ggplot2
                            
                                R: Swap two variables without using a third
                            
                                plotting the means with confidence intervals with ggplot
                            
                                R unlist changes names
                            
                                How to Fit Long Text into Ggplot2 facet Titles
                            
                                Axis labels and limits with ggplot scale_x_datetime
                            
                                Can I replace NAs when joining two data frames with dplyr?
                            
                                Convert a column in R data frame to lower case
                            
                                add_column in tibble with variable column name
                            
                                Match vectors in sequence
                            
                                Move a column conveniently
                            
                                R - How to one hot encoding a single column while keep other columns still?
                            
                                How to keep dropping the first value, until the sum of the vector is less than 20?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With