Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows of a data set belonging to a factor of specified length

Tags:

r

I have a data.frame similar to the following:

df <- data.frame(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
                 individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
                 Haplotype1 = rep(1:4,2),
                 Haplotype2 = rep(5:8,2))
 > df
  population individual Haplotype1 Haplotype2
1         AA         A1          1          5
2         AA         A2          2          6
3         AA         A3          3          7
4         BB         B1          4          8
5         BB         B2          1          5
6         CC         C1          2          6
7         CC         C2          3          7
8         CC         C3          4          8

I want to create a new dataset where any population consisting of less than a specified number of individuals is omitted from the dataset. For example, I want to reanalyze the data for only populations with greater than three or more individuals. This following is the dataset I want:

> df <- df[!df$population=="BB",]
> df
  population individual Haplotype1 Haplotype2
1         AA         A1          1          5
2         AA         A2          2          6
3         AA         A3          3          7
6         CC         C1          2          6
7         CC         C2          3          7
8         CC         C3          4          8

However, I have 400 populations ranging in size from 5 to 155 individuals, and manually picking populations out by name is not feasible. I want to write a function where I say in essence "give me a dataset with all populations consisting of X number of individuals or more and delete those with less than X." Any help or feedback is appreciated.

like image 942
user1774225 Avatar asked Oct 25 '12 13:10

user1774225


3 Answers

This should do the trick:

tab <- table(df$population) > 2
df[df$population %in% names(tab)[tab], ]

#   population individual Haplotype1 Haplotype2
# 1         AA         A1          1          5
# 2         AA         A2          2          6
# 3         AA         A3          3          7
# 6         CC         C1          2          6
# 7         CC         C2          3          7
# 8         CC         C3          4          8
like image 114
Sven Hohenstein Avatar answered Nov 15 '22 07:11

Sven Hohenstein


The most direct approach I can think of is to use data.table() from the "data.table" package:

library(data.table)
DT <- data.table(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
                 individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
                 Haplotype1 = rep(1:4,2), Haplotype2 = rep(5:8,2),
                 key = "population")
## Or, convert your existing data.frame "df" to data.table:
## DT <- data.table(df, key = "population")
DT[, .SD[length(unique(individual)) >= 3], by = key(DT)]
#    population individual Haplotype1 Haplotype2
# 1:         AA         A1          1          5
# 2:         AA         A2          2          6
# 3:         AA         A3          3          7
# 4:         CC         C1          2          6
# 5:         CC         C2          3          7
# 6:         CC         C3          4          8

Update

I'm not sure if this is important to you or not, but note that with Tyler's and Sven's current solutions, although the output is correct according to the data in the question you've posted, there is actually some potentially flawed thinking going on.

I write "potentially" because you mention that you're looking for groups (from df$population) where there are three or more individuals (from df$individual). However, both of their solutions currently only look at the lengths of population, while by your actual question I would have assumed that you would want the number of unique individuals mentioned by population.

Here's a simple example. Using your original "df", change the individual in row 3 to "A2" (df[3, 2] <- "A2"). Now, according to your criteria in your question, only rows with population == "CC" should be returned.

If your data already only has unique individuals, then no problem--but I thought I would mention it ;)


A base R solution that keeps this logic into account is:

uniqueIndividuals <- ave(as.character(df$individual), 
                         df$population, FUN = function(x) length(unique(x)))
df[which(as.numeric(uniqueIndividuals) >= 3), ]
like image 33
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 15 '22 07:11

A5C1D2H2I1M1N2O1R2T1


This would work as well:

lens <- tapply(df$population , df$population, length)
df[df$population %in% names(lens)[lens > 2], ]

EDIT: Per mrdwab's sharp reading I have edited my answer. I must admit I looked at the input and output only:

lens <- tapply(df$individual, df$population, function(x) length(unique(x)))
df[df$population %in% names(lens)[lens > 2], ]
like image 33
Tyler Rinker Avatar answered Nov 15 '22 09:11

Tyler Rinker