I have a dataframe that looks like this.
input dataframe
position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2
These data are nucleotide differences at a given position in a hypothetical genome, mean_freq
is relative to the reference, so the first row means the proportion of C's
are 0.002
implying the A
are at 0.998
.
I want to transform this to a different structure by creating new columns such that,
desired_output
position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2
I have attempted this approach
per_position_full_nt_freq <- function(x){
df <- data.frame(A=0, C=0, G=0, T=0)
idx <- names(df) %in% x$alternative
df[,idx] <- x$mean_freq
idx2 <- names(df) %in% x$reference
df[,idx2] <- 1 - x$mean_freq
df$position <- x$position
df$sampleName <- x$sampleName
return(df)
}
desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)
I ran into an error
In matrix(value, n, p) :
data length [8905] is not a sub-multiple or multiple of the number of columns
additionally, I feel there has to be a more intuitive solution and presumably using tidyr
or dplyr
.
How do I conveniently transform the input dataframe to the desired output dataframe format?
Thank you.
To summarize, if you need to reshape a Pandas dataframe from long to wide, use pd. pivot() . If you need to reshape a Pandas dataframe from wide to long, use pd. melt() .
The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.
melt() function is used to reshape a DataFrame from a wide to a long format. It is useful to get a DataFrame where one or more columns are identifier variables, and the other columns are unpivoted to the row axis leaving only two non-identifier columns named variable and value by default.
A dataset can be written in two different formats: wide and long. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column.
One option would be to create a matrix
of 0's with the 'G', 'C', 'T', 'A' column names, match
with the column names of the original dataset, use the row/column
index to assign the values and then cbind
with the original dataset's 'position' and 'sample_id', columns
m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))] <- df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))] <- 0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
# position G C T A sample_id
#1 1 0.00 0.002 0.00 0.098 name1
#2 2 0.06 0.000 0.04 0.000 name1
#3 3 0.00 0.030 0.00 0.070 name2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With