I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table:
DT = data.table(Loc1=rep(c("G","T"),each=5),
Loc2=c("C","A"), Loc3=c("C","G","G","G",
"C","G","G","G","G","G"),
Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
for(i in 1:3)
set(DT, sample(10,2), i, NA)
> DT
Loc1 Loc2 Loc3 Group
1: G NA C G1
2: G A G G1
3: G C G G1
4: NA NA NA G2
5: G C NA G2
6: T A G G2
7: T C G G2
8: T A G G3
9: T C G G3
10: NA A G G3
The problem I have is that when I try to do calculations by group, only the allele i.d.s present in the group are recognized, so I'm struggling to find code that can tell me e.g. the proportion of G's for locus 1 in all 3 groups. Simple example, calculating a sum (not proportion) for the first allele at each locus:
> fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
> DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
Group Loc1 Loc2 Loc3
1: G1 3 1 1
2: G2 1 2 2
3: G3 2 2 3
For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. I want the number of G's for both in this case. So the key problem is that the allele identities are determined by group, not over the whole column. I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. Allele identities table:
> fun2<- function(x){sort(na.omit(unique(x)))}
> allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
> allele.id
Loc1 Loc2 Loc3
1: G A C
2: T C G
It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2
for example). With the melt
function of data.table
(which works the same as the melt
function of the reshape2
package) you can transform from wide to long format:
DT2 <- melt(DT, id = "Group", variable.name = "loci")
When you want to remove the NA
-values during the melt-operation, you can add na.rm = TRUE
in the above call (na.rm = FALSE
is the default behaviour).
Then you can make count and proportion variables as follows:
DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]
which gives the following result:
> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000
I you want it back in wide format, you can use dcast
on multiple variables:
DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)
which results in:
> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
Another and straightforward approach is using melt
and dcast
in one call (which is a simplified version of the first part of @Frank's answer):
DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)
which gives:
> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0
Because the default aggregation function in dcast
is length
, you will automatically get the counts for each of the values.
Used data:
DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA),
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With