I have a huge dataset (more than 2 million rows of over 100 variables; below is a small sample). For each subj_trial
group, I want to find the first occurrence of each unique variable containing in ".wav" in message
. It should be just containing, not ending (i.e. *.wav), because some rows have a bunch of information in the message
fields (not pictured in the example, sorry).
It would be OK to output a data.frame that only had those three columns, but it's not necessary. I will later need to use the timestamp
column for analyses.
I've found this question: Extract rows for the first occurrence of a variable in a data frame, but for the life of me I cannot work that example to fit mine.
Here's some sample data:
subj_trial message timestamp
1 1_1 message 459 755616
2 1_1 . 755618
3 1_1 test1.wav 755662
4 1_1 . 765712
5 1_1 test1.wav 767918
6 1_2 . 769342
7 1_2 test2.wav 775662
8 1_2 . 786412
9 1_2 test2.wav 797460
10 1_2 . 807626
11 1_3 test3.wav 817794
12 1_3 warning 11 827960
13 2_1 message 481 817313
14 2_1 test1.wav 817347
15 2_1 . 834959
16 2_1 test1.wav 855007
17 2_1 . 880107
18 2_2 . 895723
19 2_2 test2.wav 922671
20 2_2 . 958003
21 2_2 test2.wav 994385
22 2_3 . 1016217
23 2_3 test3.wav 1036899
24 2_3 . 1047331
25 2_3 test3.wav 1142527
This is a very small example of what I'm dealing with, here. For each subj_trial
group there are probably 3000 lines, and there are over 700 groups.
Here's an example of what I'd like to have.
subj_trial message timestamp
1 1_1 test1.wav 755662
2 1_2 test2.wav 775662
3 1_3 test3.wav 817794
4 2_1 test1.wav 817347
5 2_2 test2.wav 922671
6 2_3 test3.wav 1036899
I've figured out how to get the unique values in message
over the entire dataset by doing this:
unique_message <- df[match(unique(df$message), df$message),]
But I can't figure out how to do it by group. I've also tried using group_by
in the dplyr
package but can't get that to work, either. Have mercy and show me the way, friends. Thanks!
Here is a dplyr
solution as well, if you are interested:
dat %>%
filter(grepl("\\.wav", message)) %>%
group_by(subj_trial) %>%
top_n(n=1, wt=desc(timestamp))
First, filter the data to just those containing *.wav in the message column. Then group the data by subject trial and return the top result with the smallest timestamp. This assumes you want the smallest timestamp, not necessarily the first one in the data set (i.e. if a record with a larger timestamp came first, it would NOT be returned). It wasn't clear to me which you were looking for, but perhaps in your case there is not difference.
And since I'm always curious about the efficiency differences between data.table
and dplyr
approaches, I did a microbenchmark
test. It looks like in this case, data.table
has a slight speed advantage:
library(microbenchmark)
library(data.table)
set.seed(1)
dat <- data.frame(subj_trial=paste0(sample(1:20,1e6,replace=TRUE),"_",sample(1:20,1e6,replace=TRUE)),
message=sample(c(".wav","others"), 1e6, replace=TRUE),
timestamp=round(seq(from=1000, to=9142527, length.out = 1e6)))
dat2 <- dat
setDT(dat2)
microbenchmark({dat %>%
filter(grepl("\\.wav", message)) %>%
group_by(subj_trial) %>%
top_n(1, wt=desc(timestamp))},
{dat2[grepl("\\.wav", message), .SD[1], by=subj_trial]})
Unit: milliseconds
expr
dat %>% filter(grepl("\\\\.wav", message)) %>% group_by(subj_trial) %>% top_n(1, wt = desc(timestamp))
dat2[grepl("\\\\.wav", message), .SD[1], by = subj_trial]
min lq mean median uq max neval cld
332.9693 357.7426 387.2245 367.6443 380.9935 637.9223 100 b
263.0292 272.8627 293.4976 281.4568 285.7699 582.9954 100 a
Also using data.table, but with a more concise formulation:
setDT(DT)
DT[,.SD[grep("\\.wav",message)[1]],by=subj_trial]
Edit: As suggested by a comment below,
DT[grepl("\\.wav", message), .SD[1], by=subj_trial]
might be even faster, since it uses boolean logic and the optimized I
subsetting.
.SD is a data.table containing the Subset of DT's Data for each group, excluding any columns used in by (or keyby).
by
is a bit like thegroup by
operator in SQL. It designates the grouping column(s).
grep(pattern, x)
returns the index of the all matches for thepattern
inx
, wherex
is a vector. The\\
before.wav
prevents grep from treating.
as a special character (in grep's parsing, an unescaped.
means 'anything').
vector_name[1]
returns the first element of a vector called vector_name. it can be called on the results of a function, such as grep above.the
data.table
formula isDT[I,J,by]
--I
is the subset or join,J
is the operation to be performed,by
is the grouping element. In our case,I
is ignored (hence the leading,
) since we want to work on the full set.J
is the operation on all .SD columns. by is the column you want your results grouped by.
Using data.table
:
library(data.table)
setDT(DT)
DT[,{
id=head(grep("\\.wav",message),1)
list(message=message[id],timestamp=timestamp[id])
},subj_trial]
# subj_trial message timestamp
# 1: 1_1 test1.wav 755662
# 2: 1_2 test2.wav 775662
# 3: 1_3 test3.wav 817794
# 4: 2_1 test1.wav 817347
# 5: 2_2 test2.wav 922671
# 6: 2_3 test3.wav 1036899
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With