Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering rows with specific characteristics relative to other rows

Tags:

r

I have a dataset that consists of two columns: one consists of a list of names and the other identifies a group to which the corresponding name belongs (example data and code for reproducing example data below).

group   name
1   aaa
1   aaa
1   aaa
2   NA
2   NA
3   NA
3   bbb
4   ccc
4   ccc
5   ddd
5   eee
6   fff
6   ffg

> x <- c(1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
> y <- c("aaa", "aaa", "aaa", NA, NA, NA, "bbb", "ccc", "ccc", "ddd", "eee", "fff", "ffg")
> mydataframe <- data.frame( "group" = x, "name" = y)

The dataset consists of more than 500,000 rows (as well as other columns which are irrelevant to this question). Groups may include may have one member (one row) or many members (multiple rows). When a group has multiple members, the names of the members may be: a) the same (e.g., group 1 where all members are named "aaa"); b) blank (e.g., group 2 where all members are NA); c) a mix of names and blanks (e.g., group 3 that includes one blank name and one name "bbb"); or d) different names (e.g., group 5 that has "ddd" and "eee"). When there are different names they may be very different (e.g., group 5) or similar (group 6)

I need to filter the dataset to identify rows that fall into category d - those rows that have the same group number but different (non-NA) names (in the example below, groups 5 and 6).

The filter would produce a new dataset that looks looks like this:

group   name
5   ddd
5   eee
6   fff
6   ffg

Bonus points if someone can help me not only produce the result above, but also distinguish between groups that have members with totally different names (group 5) and groups that have members with similar, but different names (group 6).

Thank you in advance!

like image 271
David S Avatar asked Feb 10 '23 06:02

David S


1 Answers

Here's a possible data.table v >= 1.9.5 solution

library(data.table)
setDT(mydataframe)[, .SD[uniqueN(na.omit(name)) == .N], by = group]
#    group name
# 1:     5  ddd
# 2:     5  eee
# 3:     6  fff
# 4:     6  ffg

Basically we are comparing the number of unique groups (while NAs removed) to the actual group size (.N) and subsetting it out of .SD (which stands for Sub Data) which is our actual data set subseted by condition.


Alternatively with dplyr where we using n_distinct and n() instead (for some reason this solution seem not to work without converting to data.table object first using setDT(mydataframe), so if you haven't done so yet, do it before running this code)

library(dplyr)
mydataframe %>%
  group_by(group) %>%
  filter(n_distinct(na.omit(name)) == n())

# Source: local data table [4 x 2]
# Groups: group
# 
#   group name
# 1     5  ddd
# 2     5  eee
# 3     6  fff
# 4     6  ffg

For the bonus points you could also try agrep (you can set the max.distance as you wish- 1 is the default). 1- means similar, NA- means no similarity found.

mydataframe %>%
  group_by(group) %>%
  filter(n_distinct(na.omit(name)) == n()) %>%
  mutate(similarity = agrep(name[1L], name[-1L]))

# Source: local data table [4 x 3]
# 
#   group name similarity
# 1     5  ddd         NA
# 2     5  eee         NA
# 3     6  fff          1
# 4     6  ffg          1
like image 147
David Arenburg Avatar answered Feb 13 '23 03:02

David Arenburg