I have a data frame that looks something like this (with a lot more observations)
df <- structure(list(session_user_id = c("1803f6c3625c397afb4619804861f75268dfc567",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"198b83b365fef0ed637576fe1bde786fc09817b2", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13",
"1b64c142b1540c43e3f813ccec09cb2dd7907c14", "1b7346d13f714c97725ba2e1c21b600535164291"
), raw_score = c(1, 1, 1, 1, 1, 0.2, NA, 1, 1, 1, 1, 0.2, 1),
submission_time = c(1389707078L, 1389694184L, 1389694188L,
1389694189L, 1389694194L, 1390115495L, 1389696939L, 1389696971L,
1389741306L, 1389985033L, 1389983862L, 1389854836L, 1389692240L
)), .Names = c("session_user_id", "raw_score", "submission_time"
), row.names = 28:40, class = "data.frame")
I want to create a new data frame with only one observation per "session_ user_id" by keeping the one with the latest "submission_time."
The only idea that I have in mind is to create a list of unique users. Write a loop to find the max of submission_time for each user and then write a loop that gets raw score fore that user and time.
Can somebody show me a better way of doing this in R?
Thanks!
To find unique values in a column in a data frame, use the unique() function in R. In Exploratory Data Analysis, the unique() function is crucial since it detects and eliminates duplicate values in the data.
The last n rows of the data frame can be accessed by using the in-built tail() method in R. Supposedly, N is the total number of rows in the data frame, then n <=N last rows can be extracted from the structure.
Use the dplyr filter function to get the first and the last row of each group. This is a combination of duplicates removal that leaves the first and last row at the same time.
You could first order your data.frame
by submission_time
and remove all duplicated session_user_id
entries afterwards:
## order by submission_time
df <- df[order(df$submission_time, decreasing=TRUE),]
## remove duplicated user_id
df <- df[!duplicated(df$session_user_id),]
# session_user_id raw_score submission_time
#33 198b83b365fef0ed637576fe1bde786fc09817b2 0.2 1390115495
#37 19fd8069c094fb0697508cc9646513596bea30c4 1.0 1389985033
#38 1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13 1.0 1389983862
#39 1b64c142b1540c43e3f813ccec09cb2dd7907c14 0.2 1389854836
#28 1803f6c3625c397afb4619804861f75268dfc567 1.0 1389707078
#32 1924cb2ebdf29f052187b9a2d21673e4d314199b 1.0 1389694194
#40 1b7346d13f714c97725ba2e1c21b600535164291 1.0 1389692240
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With