Spread multiple coordinated columns

Question

I have relatively tidy data with samples, genes, alleles, and frequencies in separate columns. For each gene and for each sample I need to split out the alleles and their corresponding frequency into separate columns. Here's what I have and what I need.

Trying to do this with dplyr/tidyr, but I'll take any solution I can get.

What I have:

data.frame(sample=rep("sample1", 10), 
           gene=rep(paste0("gene", 1:5), each=2), 
           allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
           freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

#     sample  gene allele freq
# 1  sample1 gene1      A  0.9
# 2  sample1 gene1      G  0.1
# 3  sample1 gene2      A  0.8
# 4  sample1 gene2      C  0.2
# 5  sample1 gene3      A  0.7
# 6  sample1 gene3      T  0.3
# 7  sample1 gene4      C  0.6
# 8  sample1 gene4      G  0.4
# 9  sample1 gene5      G  0.5
# 10 sample1 gene5      T  0.5

What I want:

data.frame(sample=rep("sample1", 5), 
           gene=paste0("gene", 1:5), 
           allele1=c("A", "A", "A", "C", "G"), 
           allele2=c("G", "C", "T", "G", "T"), 
           freq1=c(.9, .8, .7, .6, .5), 
           freq2=c(.1, .2, .3, .4, .5))

#    sample  gene allele1 allele2 freq1 freq2
# 1 sample1 gene1       A       G   0.9   0.1
# 2 sample1 gene2       A       C   0.8   0.2
# 3 sample1 gene3       A       T   0.7   0.3
# 4 sample1 gene4       C       G   0.6   0.4
# 5 sample1 gene5       G       T   0.5   0.5

akrun · Accepted Answer

You can use dcast from the devel version of data.tableie. 1.9.5+ which can take multiple value.var columns. We create a sequence column ('indx') grouped by 'sample' and 'gene'. Then dcast from long to wide format mentioning the value.var columns.

 library(data.table)#v1.9.5+ 
 setDT(df)[, indx:=1:.N,.(sample, gene)]
 dcast(df, sample+gene~indx, value.var=c('allele', 'freq'), sep= '')
 #    sample  gene   allele1 allele2  freq1 freq2
 #1: sample1 gene1        A        G    0.9    0.1
 #2: sample1 gene2        A        C    0.8    0.2
 #3: sample1 gene3        A        T    0.7    0.3
 #4: sample1 gene4        C        G    0.6    0.4
 #5: sample1 gene5        G        T    0.5    0.5

NOTE: Instructions to install the devel version are here

The sep='' argument is useful to create the column names as 'allele1', 'allele2' etc. The default would be `allele_1', 'allele_2' etc. (from @Arun's comments)

Spread multiple coordinated columns

Tags:

r

Stephen Turner

1 Answers

akrun

Recent Activity

Donate For Us

Spread multiple coordinated columns

Tags:

r

Stephen Turner

1 Answers

akrun

Related questions

Recent Activity

Donate For Us