I'd like to create a dataframe from a dataframe that created before. my first dataframe is:
Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
Then I want to create a dataframe like below, for all (96*24-motifs*chromosomes-):
Sample CT-G.A,chr1 TA-C.C,chr1 TC-G.C,chr1 CG-A.T,ch1 CA-G.T,ch1 CT-G.A,chr2 TA-C.C,chr2 TC-G.C,chr2 CG-A.T,ch2 CA-G.T,ch2
1 1 1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 1
Here is a possble solution using dplyr
and tidyr
.
We add a column value
that indicates if a chromosome is present, then complete the data.frame
, making sure we have rows for each motif-chromosome-Sample
combination, where missing combinations get a 0
in the value column. We create a key
out of the motif and chromosome columns, and then discard those columns. Lastly, we reshape the data.frame
from long to wide (see here) to get your desired format. Hope this helps!
df = read.table(text="Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
2 CA-G.T 2",header=T)
library(tidyr)
library(dplyr)
df %>% mutate(value=1) %>% complete(motif,chromosome,Sample,fill=list(value=0)) %>%
mutate(key=paste0(motif,',chr',chromosome)) %>%
group_by(Sample,key) %>%
summarize(value = sum(value)) %>%
spread(key,value) %>%
as.data.frame
Output:
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 2 0 1 0 0 0 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With