Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subtracting every two columns

Tags:

regex

r

Imagine I have a dataframe like this (or the names of all months)

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))   

.

id Mary1 Mary2  Bob1  Bob2 Dylan1 Dylan2  Tom1  Tom2 Jane1 Jane2  Sam1  Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2  Pam1  Pam2
1  0.266 0.372 0.573 0.908  0.202  0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2  0.935 0.212 0.652 0.126  0.267  0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3  0.821 0.647 0.783 0.553  0.530  0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407

Usually with many more columns.

And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)

id, Mary1-Mary2,  Bob1-Bob2,  Dylan1-Dylan2,  Tom1-Tom2,  Jane1-Jane2,...

This operation is quite common.

I'd like to do it by name, not by position, to prevent problems if they are not consecutive. It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.

(*) The names of the columns have a prefix and a number. Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.

I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.

R, data.table or dplyr, long format splitting colnames

I don't mind the speed but the simplest solution. Any package is wellcome.

PD: All your codes fail if I add a lonely column. set.seed(1)

mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )

I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)

The best option would be not manually remove the first column but split all columns in single and multiple columns.

like image 926
skan Avatar asked May 04 '16 09:05

skan


1 Answers

How about using base R:

cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )

You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:

sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))

This paste approach could be replaced by regular expressions for more general applications.

like image 161
Raad Avatar answered Sep 24 '22 22:09

Raad