Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining 2 columns into 1 column many times in a very large dataset in R

Tags:

merge

r

Combining 2 columns into 1 column many times in a very large dataset in R

The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).

Here is a toy example of the data structure:

pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2),  rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))

Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.

1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):

11
31
11
31
31
11
11
11
31
11

2) I need to identify the least frequent SNP value (in the above example it is 31).

3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.

like image 314
S.R. Avatar asked Feb 28 '10 00:02

S.R.


People also ask

How do I combine data from multiple columns into one column in R?

How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.

How do I combine two columns of datasets in R?

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

Can I aggregate multiple columns in R?

We can use the aggregate() function in R to produce summary statistics for one or more variables in a data frame. where: sum_var: The variable to summarize.


1 Answers

Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then

R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""), 
+                                rs157=paste(pop[,7],pop[,8],sep=""), 
+                                rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
   status sex age disType rs123 rs157 rs132
1       0   0  42       0    11    24    44
2       1   1  37       0    31    44    44
3       1   0  38       0    11    24    44
4       0   1  45       0    31    22    44
5       1   1  25       0    31    24    44
6       0   1  31       0    11    44    44
7       1   0  43       0    11    44    44
8       0   0  41       0    11    44    44
9       1   1  57       0    31    22    24
10      1   1  40       0    11    22    24

and now you can do counts and whatnot on pop2:

R> sapply(pop2[,5:7], table)
$rs123

11 31 
 6  4 

$rs157

22 24 44 
 3  3  4 

$rs132

24 44 
 2  8 

R> 
like image 135
Dirk Eddelbuettel Avatar answered Sep 29 '22 09:09

Dirk Eddelbuettel