I have a dataframe that looks like this. <pre class="prettyprint"><code>input dataframe position,mean_freq,reference,alternative,sample_id 1,0.002,A,C,name1 2,0.04,G,T,name1 3,0.03,A,C,name2 </code></pre> These data are nucleotide differences at a given position in a hypothetical genome, <code>mean_freq</code> is relative to the reference, so the first row means the proportion of <code>C's</code> are <code>0.002</code> implying the <code>A</code> are at <code>0.998</code>. I want to transform this to a different structure by creating new columns such that, <pre class="prettyprint"><code>desired_output position,G,C,T,A,sampleid 1,0,0.002,0,0.998,name1 2, 0.96,0,0.04,0,name 3,0,0.93,0,0.07,name2 </code></pre> I have attempted this approach <pre class="prettyprint"><code>per_position_full_nt_freq <- function(x){ df <- data.frame(A=0, C=0, G=0, T=0) idx <- names(df) %in% x$alternative df[,idx] <- x$mean_freq idx2 <- names(df) %in% x$reference df[,idx2] <- 1 - x$mean_freq df$position <- x$position df$sampleName <- x$sampleName return(df) } desired_output_dataframe <- per_position_full_nt_freq(input_dataframe) </code></pre> I ran into an error <pre class="prettyprint"><code>In matrix(value, n, p) : data length [8905] is not a sub-multiple or multiple of the number of columns </code></pre> additionally, I feel there has to be a more intuitive solution and presumably using <code>tidyr</code> or <code>dplyr</code>. How do I conveniently transform the input dataframe to the desired output dataframe format? Thank you.

One option would be to create a <code>matrix</code> of 0's with the 'G', 'C', 'T', 'A' column names, <code>match</code> with the column names of the original dataset, use the <code>row/column</code> index to assign the values and then <code>cbind</code> with the original dataset's 'position' and 'sample_id', columns <pre class="prettyprint"><code>m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A"))) m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))] <- df1$mean_freq m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))] <- 0.1 - df1$mean_freq cbind(df1['position'], m1, df1['sample_id']) # position G C T A sample_id #1 1 0.00 0.002 0.00 0.098 name1 #2 2 0.06 0.000 0.04 0.000 name1 #3 3 0.00 0.030 0.00 0.070 name2 </code></pre>

transform a dataframe of frequencies to a wider format

I have a dataframe that looks like this.

input dataframe

position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2

These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.

I want to transform this to a different structure by creating new columns such that,

desired_output

position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2

I have attempted this approach

per_position_full_nt_freq <- function(x){
  df <- data.frame(A=0, C=0, G=0, T=0)
  idx <- names(df) %in% x$alternative
  df[,idx] <- x$mean_freq
  idx2 <- names(df) %in% x$reference 
  df[,idx2] <- 1 - x$mean_freq
  df$position <- x$position
  df$sampleName <- x$sampleName
  return(df)
}

desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)

I ran into an error

In matrix(value, n, p) :
  data length [8905] is not a sub-multiple or multiple of the number of columns

additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr. How do I conveniently transform the input dataframe to the desired output dataframe format?

Thank you.

How do you change a data frame from long to wide?

To summarize, if you need to reshape a Pandas dataframe from long to wide, use pd. pivot() . If you need to reshape a Pandas dataframe from wide to long, use pd. melt() .

How do you reshape data to wide format in R?

The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.

How do you reshape a data frame?

melt() function is used to reshape a DataFrame from a wide to a long format. It is useful to get a DataFrame where one or more columns are identifier variables, and the other columns are unpivoted to the row axis leaving only two non-identifier columns named variable and value by default.

What is wide format in DataFrame?

A dataset can be written in two different formats: wide and long. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column.

One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns

m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))]  <-  df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))]  <-  0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
#   position    G     C    T     A sample_id
#1        1 0.00 0.002 0.00 0.098     name1
#2        2 0.06 0.000 0.04 0.000     name1
#3        3 0.00 0.030 0.00 0.070     name2

transform a dataframe of frequencies to a wider format

Tags:

dataframe

r

dplyr

tidyverse

eastafri

People also ask

1 Answers

akrun

Recent Activity

Donate For Us

transform a dataframe of frequencies to a wider format

Tags:

dataframe

r

dplyr

tidyverse

eastafri

People also ask

1 Answers

akrun

Related questions

Recent Activity

Donate For Us