Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate (paste) elements based on indices

Tags:

r

I want to concatenate (paste) rows of an elements in a data.frame based on indices. How can I do this efficiently? This is easy with a unique identifier but here I have indices not a factor to concatenate on. All package and base solution options welcomed.

indexes <- list(
    2:3,
    6:8,
    11:12
)


dat <- data.frame(
    x = c(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 2, 2, 1),
    y = LETTERS[1:13],
    z = "PP", stringsAsFactors = FALSE
)

   x y  z
1  1 A PP
2  2 B PP
3  2 C PP
4  3 D PP
5  4 E PP
6  5 F PP
7  5 G PP
8  5 H PP
9  6 I PP
10 7 J PP
11 2 K PP
12 2 L PP
13 1 M PP

Desired output:

   x y         z
1  1 A        PP
2  2 B, C     PP
3  3 D        PP
4  4 E        PP
5  5 F, G, H  PP
6  6 I        PP
7  7 J        PP
8  2 K, L     PP
9  1 M        PP
like image 258
Tyler Rinker Avatar asked Oct 08 '15 17:10

Tyler Rinker


2 Answers

Here's one possible solution in base R:

dat[sapply(indexes, "[", 1), "y"] <- sapply(indexes, function(i)  {
    paste(dat[["y"]][i], collapse = ", ")
})

dat[  setdiff(  1:nrow(dat), 
                        setdiff(
                          unlist(indexes),
                          sapply(indexes, "[", 1)
                          )
                        ), ]

   x       y  z
1  1       A PP
2  2    B, C PP
4  3       D PP
5  4       E PP
6  5 F, G, H PP
9  6       I PP
10 7       J PP
11 2    K, L PP
13 1       M PP
like image 71
data_steve Avatar answered Sep 19 '22 06:09

data_steve


Another base R method:

indx <- !(1:nrow(dat) %in% unlist(lapply(indexes, '[', -1)))
transform(dat, y=ave(y, cumsum(indx), FUN=toString))[indx,]
#    x       y  z
# 1  1       A PP
# 2  2    B, C PP
# 4  3       D PP
# 5  4       E PP
# 6  5 F, G, H PP
# 9  6       I PP
# 10 7       J PP
# 11 2    K, L PP
# 13 1       M PP

Explanation

Some insight into how !(1:nrow(dat) %in% unlist(lapply(indexes, '[', -1))) came about:

I tried to find an index to group by. I started from the end and worked backwards. I knew that if I could get:

1 2 2 3 4 5 5 5 6 7 8 8 9

I could use ave and run toString. I thought what true and false combination would be needed to make cumsum give the above index as output. I wrote that out:

cumsum(c(T, T, F, T, T, T, F, F, T, T, T, F, T))
 [1] 1 2 2 3 4 5 5 5 6 7 8 8 9

I needed to find a way to create that logical index. If all of the indexes list elements that aren't the first were false, I'd have the logical index I needed.

unlist(lapply(indexes, '[', -1))
[1]  3  7  8 12

You'll notice that those positions are all of the false values in the index.

dplyr

I guess it's only fair to add dplyr to the mix:

dat %>% 
  mutate(indx = na.omit(c(T, x != lead(x)))) %>% 
  group_by(ind2=cumsum(indx)) %>% 
  mutate(y=toString(y)) %>%
  filter(indx)
like image 43
Pierre L Avatar answered Sep 19 '22 06:09

Pierre L