Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine column values in an R dataframe all at once

Tags:

r

Is there a way to combine R data columns with other columns all at once?

For example,

asd <- data.frame(a = c("A","B"), b = c("d","f"), c = c("x","y"))
asd
  a b c
1 A d x
2 B f y

Expected output (combine column 'a' with both column b and c):

  a  b  c
1 A Ad  Ax
2 B Bf  By
like image 917
user11740857 Avatar asked Jun 30 '21 07:06

user11740857


People also ask

How do I combine columns from a different Dataframe in R?

Method 1 : Using plyr package rbind. fill() method in R is an enhancement of the rbind() method in base R, is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA.

How do I combine variable values in R?

Merging datasets You can merge columns, by adding new variables; or you can merge rows, by adding observations. To add columns use the function merge() which requires that datasets you will merge to have a common variable. In case that datasets doesn't have a common variable use the function cbind .

How to merge data frames on multiple columns in R?

You can use the following basic syntax to merge two data frames in R based on multiple columns: merge(df1, df2, by. x =c(' col1 ', ' col2 '), by. y =c(' col1 ', ' col2 ')) The following example shows how to use this syntax in practice. Example: Merge Data Frames on Multiple Columns. Suppose we have the following two data frames in R:

How to concatenate two columns of Dataframe in R?

Let’s see how to Concatenate two columns of dataframe in R. Concatenate numeric and string column in R. Concatenate two columns by removing leading and trailing space. merge or concatenate two or more columns in R using str_c () and unite () function. Let’s first create the dataframe.

How to combine values of two columns separated with hyphen in R?

To combine values of two columns separated with hyphen in an R data frame, we can use apply function. For Example, if we have a data frame called df that contains only two columns say X and Y then we can combine the values in X and Y by using the below command given below −

What if the column name is same in both data frames?

In case, any of the column name is same in both of the input data frames, then the following properties are encountered : The class of the common column should be same in both the data frames, otherwise an error is encountered.


Video Answer


4 Answers

You can use paste0 with the first column asd[[1]] and the unlisted other columns unlist(asd[-1]) and assign it back in the data.frame in place of the other columns asd[-1].

asd[-1] <- paste0(asd[[1]], unlist(asd[-1]))
#  a  b  c
#1 A Ad Ax
#2 B Bf By

Disable recursive and use.names in unlist might improve the perfomance:

asd[-1] <- paste0(asd[[1]], unlist(asd[-1], FALSE, FALSE))

The same but using names:

S <- c("b", "c")
asd[S] <- paste0(asd$a, unlist(asd[S]))

Another way is to use paste0 in Map and subset asd once with [-1] excluding the first column and [rep(1,2)] getting the first column 2 times.

asd[-1] <- Map(paste0, asd[rep(1,2)], asd[-1])

The same but using names:

S <- c("b", "c")
asd[S] <- Map(paste0, asd[rep("a", length(S))], asd[S])

Another way will be to use a for loop;

for(i in 2:3) {asd[[i]] <- paste0(asd[[1]], asd[[i]])}

for(i in c("b", "c")) {asd[[i]] <- paste0(asd$a, asd[[i]])}

Comparing the methods:

getDf <- function(nr, nc) { #function to creat example dataset
    data.frame(a = sample(LETTERS, nr, TRUE),
               setNames(replicate(nc, sample(letters, nr, TRUE), simplify=FALSE), paste0("b", seq_len(nc))))
}

library(dplyr)
library(stringr)
library(purrr)
M <- alist(
    unlist = (function(asd) {asd[,-1] <- paste0(asd[,1], unlist(asd[,-1], FALSE, FALSE)); asd})(D)
  , Map = (function(asd) {asd[-1] <- Map(paste0, asd[rep(1,ncol(asd)-1)], asd[-1]); asd})(D)
  , "for" = (function(asd) {for(i in 2:ncol(asd)) {asd[[i]] <- paste0(asd[,1], asd[,i])}; asd})(D)
  , "for+str_c" = (function(asd) {for(i in 2:ncol(asd)) {asd[[i]] <- str_c(asd[,1], asd[,i])}; asd})(D)
  , lapply = (function(asd) {asd[-1] <- lapply(asd[-1], function(x) paste0(asd$a, x)); asd})(D)
  , across = (function(asd) {asd <- asd %>% mutate(across(-a, ~str_c(a, .x))); asd})(D)
  , pmap = (function(asd) {asd <- asd %>%
  pmap_dfr(~ c(list(...)[1], setNames(paste(..1, c(...)[-1], sep = ""), names(asd)[-1]))); as.data.frame(asd)})(D)
  , "row+matrix" = (function(asd) {asd[-1] <- paste0(asd$a[row(asd[-1])], as.matrix(asd[-1])); asd})(D)
  , apply = (function(asd) {asd[-1] <- apply(asd[-1], 2, function(x) paste0(asd[[1]], x)); asd})(D)
)
D <- getDf(1e5,2) #1e5 rows and 2 columsn
bench::mark(exprs = M)
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 unlist     29.07ms 29.92ms    29.5     12.68MB    11.8     15     6      509ms
#2 Map        22.94ms 23.02ms    42.6      1.53MB     1.94    22     1   516.38ms
#3 for        22.84ms 22.96ms    42.8      1.53MB     1.94    22     1   514.15ms
#4 for+str_c   9.78ms    10ms    97.2      1.53MB     3.97    49     2   503.89ms
#5 lapply     22.89ms 23.01ms    42.7      1.53MB     1.94    22     1   514.82ms
#6 across     12.29ms 12.57ms    77.8      1.53MB     1.99    39     1   501.43ms
#7 pmap         2.95s   2.95s     0.339    9.54MB     6.45     1    19      2.95s
#8 row+matrix 30.64ms 32.65ms    19.8     14.97MB     6.09    13     4   656.35ms
#9 apply      32.93ms 34.12ms    27.7     19.55MB     5.94    14     3   504.85ms
#Warning message:
#Some expressions had a GC in every iteration; so filtering is disabled. 
D <- getDf(1e2, 1e3)
bench::mark(exprs = M)
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 unlist      21.4ms  21.7ms     45.2    18.08MB     9.68    14     3      310ms
#2 Map           28ms  28.1ms     35.3    12.53MB     4.41    16     2      453ms
#3 for         39.3ms  39.4ms     25.4      8.5MB     2.11    12     1      473ms
#4 for+str_c   34.1ms  34.3ms     29.1      8.5MB     4.48    13     2      447ms
#5 lapply      21.9ms  22.1ms     44.7    12.48MB     7.46    18     3      402ms
#6 across      80.3ms  80.9ms     12.3     5.98MB     4.93     5     2      406ms
#7 pmap       113.9ms   114ms      8.74    17.5MB     5.83     3     2      343ms
#8 row+matrix  24.5ms  24.6ms     40.2    19.31MB    10.7     15     4      373ms
#9 apply       32.3ms  32.5ms     30.5    21.72MB    11.1     11     4      360ms

Regarding Memory usage across and the for-loop could be recommended. Regarding speed in case of two rows Map, for and lapply in case of 1000 rows unlist and lapply so overall lapply could be recommended. Also using str_c instead of paste could improve performance.


In case all columns have the same type it could be considered to store the data in a matrix what will show advantives in case of many columns.

M <- as.matrix(asd)

M[,-1] <- paste0(M[,1], M[,-1])

M
#     a   b    c   
#[1,] "A" "Ad" "Ax"
#[2,] "B" "Bf" "By"
D <- getDf(1e5,2)
M <- as.matrix(D)
bench::mark(check = FALSE #One gives a data frame the other a matirx
 , lapply = (function(asd) {asd[-1] <- lapply(asd[-1], function(x) paste0(asd$a, x))})(D)
 , lapplyStr_C = (function(asd) {asd[-1] <- lapply(asd[-1], function(x) stringr::str_c(asd$a, x))})(D)
 , matrix = (function(M) {M[,-1] <- paste0(M[,1], M[,-1])})(M)
 , matrixStr_C = (function(M) {M[,-1] <- stringr::str_c(M[,1], M[,-1])})(M)
)
#  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr>  <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 lapply       28.3ms 28.8ms      34.7    1.53MB     0       18     0      519ms
#2 lapplyStr_C  13.6ms 13.9ms      71.6    1.53MB     2.05    35     1      489ms
#3 matrix       34.1ms 34.4ms      28.9    7.25MB     7.24    12     3      415ms
#4 matrixStr_C  17.8ms 18.2ms      53.9    7.25MB     7.35    22     3      408ms
D <- getDf(1e2, 1e3)
M <- as.matrix(D)
bench::mark(check = FALSE #One gives a data frame the other a matirx
 , lapply = (function(asd) {asd[-1] <- lapply(asd[-1], function(x) paste0(asd$a, x))})(D)
 , lapplyStr_C = (function(asd) {asd[-1] <- lapply(asd[-1], function(x) stringr::str_c(asd$a, x))})(D)
 , matrix = (function(M) {M[,-1] <- paste0(M[,1], M[,-1])})(M)
 , matrixStr_C = (function(M) {M[,-1] <- stringr::str_c(M[,1], M[,-1])})(M)
)
#  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 lapply       32.41ms  32.66ms      30.5   12.48MB    15.2     10     5
#2 lapplyStr_C  26.85ms  27.11ms      36.9   12.48MB    18.4     12     6
#3 matrix       16.28ms  16.94ms      59.4    2.32MB     2.05    29     1
#4 matrixStr_C   7.51ms   7.77ms     127.     2.32MB     6.90    55     3
like image 81
GKi Avatar answered Nov 16 '22 12:11

GKi


You can use lapply in base R -

asd[-1] <- lapply(asd[-1], function(x) paste0(asd$a, x))

Or across in dplyr -

library(dplyr)
library(stringr)

asd %>% mutate(across(-a, ~str_c(a, .x)))

#  a  b  c
#1 A Ad Ax
#2 B Bf By
like image 23
Ronak Shah Avatar answered Nov 16 '22 12:11

Ronak Shah


We can also use the pmap function from purrr:

library(purrr)

asd %>%
  pmap_dfr(~ c(list(...)[1], setNames(paste(..1, c(...)[-1], sep = ""), names(asd)[-1])))

# A tibble: 2 x 3
  a     b     c
  <chr> <chr> <chr>
1 A     Ad    Ax
2 B     Bf    By
like image 36
Anoushiravan R Avatar answered Nov 16 '22 11:11

Anoushiravan R


You can try

asd[-1] <- paste0(asd$a[row(asd[-1])], as.matrix(asd[-1]))

which gives

> asd
  a  b  c
1 A Ad Ax
2 B Bf By
like image 36
ThomasIsCoding Avatar answered Nov 16 '22 12:11

ThomasIsCoding