Data I'm importing describes numeric measurements taken at various locations for more or less evenly spread timestamps. sometimes this "evenly spread" is not really true and I have to discard some of the values, it's not that important which one, as long as I have one value for each timestamp for each location.
what I do with the data? I add it to a result
data.frame. There I have a timestamp
column and the values in the timestamp column, they are definitely evenly spaced according to the step
.
timestamps <- ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60 + epoch
result[result$timestamp %in% timestamps, columnName] <- values
This does NOT work when I have timestamps that fall in the same time step. This is an example:
> data.frame(ts=timestamps, v=values)
ts v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:04:18 -2.079778
3 2009-09-30 10:07:47 -2.113531
4 2009-09-30 10:09:01 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:27:56 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144
> data.frame(ts=ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60+epoch,
+ v=values)
ts v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:15:00 -2.079778
3 2009-09-30 10:15:00 -2.113531
4 2009-09-30 10:15:00 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:30:00 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144
in Python I would (mis)use a dictionary to achieve what I need:
dict(zip(timestamps, values)).items()
returns a list of pairs where the first coordinate is unique.
in R I don't know how to do it in a compact and efficient way.
To remove partial duplicates based on one or more key columns, select only those columns. If your table has many columns, the fastest way is to click the Unselect All button, and then select the columns you want to check for dupes.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
I would use subset
combined with duplicated
to filter non-unique timestamps in the second data frame:
R> df_ <- read.table(textConnection('
ts v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)
R> subset(df_, !duplicated(ts))
ts v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080
Update: To select a specific value you can use aggregate
aggregate(df_$v, by=list(df_$ts), function(x) x[1]) # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1)) # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x)) # max value
I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:
R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
order.by=Sys.time() + sort(runif(10,10,300)))
R> X
val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47 0.4550
2009-11-20 07:09:57 0.9598
2009-11-20 07:10:11 1.0018
2009-11-20 07:10:12 1.0747
2009-11-20 07:10:58 0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47 0.4550
2009-11-20 07:09:57 0.9598
R>
The X
object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With