I have a dataset that consists of two columns: one consists of a list of names and the other identifies a group to which the corresponding name belongs (example data and code for reproducing example data below).
group name
1 aaa
1 aaa
1 aaa
2 NA
2 NA
3 NA
3 bbb
4 ccc
4 ccc
5 ddd
5 eee
6 fff
6 ffg
> x <- c(1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
> y <- c("aaa", "aaa", "aaa", NA, NA, NA, "bbb", "ccc", "ccc", "ddd", "eee", "fff", "ffg")
> mydataframe <- data.frame( "group" = x, "name" = y)
The dataset consists of more than 500,000 rows (as well as other columns which are irrelevant to this question). Groups may include may have one member (one row) or many members (multiple rows). When a group has multiple members, the names of the members may be: a) the same (e.g., group 1 where all members are named "aaa"); b) blank (e.g., group 2 where all members are NA); c) a mix of names and blanks (e.g., group 3 that includes one blank name and one name "bbb"); or d) different names (e.g., group 5 that has "ddd" and "eee"). When there are different names they may be very different (e.g., group 5) or similar (group 6)
I need to filter the dataset to identify rows that fall into category d - those rows that have the same group number but different (non-NA) names (in the example below, groups 5 and 6).
The filter would produce a new dataset that looks looks like this:
group name
5 ddd
5 eee
6 fff
6 ffg
Bonus points if someone can help me not only produce the result above, but also distinguish between groups that have members with totally different names (group 5) and groups that have members with similar, but different names (group 6).
Thank you in advance!
Here's a possible data.table
v >= 1.9.5 solution
library(data.table)
setDT(mydataframe)[, .SD[uniqueN(na.omit(name)) == .N], by = group]
# group name
# 1: 5 ddd
# 2: 5 eee
# 3: 6 fff
# 4: 6 ffg
Basically we are comparing the number of unique groups (while NA
s removed) to the actual group size (.N
) and subsetting it out of .SD
(which stands for Sub Data) which is our actual data set subseted by condition.
Alternatively with dplyr
where we using n_distinct
and n()
instead (for some reason this solution seem not to work without converting to data.table
object first using setDT(mydataframe)
, so if you haven't done so yet, do it before running this code)
library(dplyr)
mydataframe %>%
group_by(group) %>%
filter(n_distinct(na.omit(name)) == n())
# Source: local data table [4 x 2]
# Groups: group
#
# group name
# 1 5 ddd
# 2 5 eee
# 3 6 fff
# 4 6 ffg
For the bonus points you could also try agrep
(you can set the max.distance
as you wish- 1 is the default). 1
- means similar, NA
- means no similarity found.
mydataframe %>%
group_by(group) %>%
filter(n_distinct(na.omit(name)) == n()) %>%
mutate(similarity = agrep(name[1L], name[-1L]))
# Source: local data table [4 x 3]
#
# group name similarity
# 1 5 ddd NA
# 2 5 eee NA
# 3 6 fff 1
# 4 6 ffg 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With