Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

transform a dataframe of frequencies to a wider format

I have a dataframe that looks like this.

input dataframe

position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2

These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.

I want to transform this to a different structure by creating new columns such that,

desired_output

position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2

I have attempted this approach

per_position_full_nt_freq <- function(x){
  df <- data.frame(A=0, C=0, G=0, T=0)
  idx <- names(df) %in% x$alternative
  df[,idx] <- x$mean_freq
  idx2 <- names(df) %in% x$reference 
  df[,idx2] <- 1 - x$mean_freq
  df$position <- x$position
  df$sampleName <- x$sampleName
  return(df)
}

desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)

I ran into an error

In matrix(value, n, p) :
  data length [8905] is not a sub-multiple or multiple of the number of columns 

additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr. How do I conveniently transform the input dataframe to the desired output dataframe format?

Thank you.

like image 949
eastafri Avatar asked Nov 11 '17 08:11

eastafri


People also ask

How do you change a data frame from long to wide?

To summarize, if you need to reshape a Pandas dataframe from long to wide, use pd. pivot() . If you need to reshape a Pandas dataframe from wide to long, use pd. melt() .

How do you reshape data to wide format in R?

The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.

How do you reshape a data frame?

melt() function is used to reshape a DataFrame from a wide to a long format. It is useful to get a DataFrame where one or more columns are identifier variables, and the other columns are unpivoted to the row axis leaving only two non-identifier columns named variable and value by default.

What is wide format in DataFrame?

A dataset can be written in two different formats: wide and long. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column.


1 Answers

One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns

m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))]  <-  df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))]  <-  0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
#   position    G     C    T     A sample_id
#1        1 0.00 0.002 0.00 0.098     name1
#2        2 0.06 0.000 0.04 0.000     name1
#3        3 0.00 0.030 0.00 0.070     name2
like image 140
akrun Avatar answered Oct 28 '22 06:10

akrun