Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a new data frame with original data separated by ; and with different counts per category?

Tags:

dataframe

r

I have a table with the following format.

df1 <- data.frame (A=c("aaa", "bbb", "ccc", "ddd"),
                   B=c("111; 222", "333", "444; 555; 666; 777", "888; 999"))

    A                  B
1 aaa           111; 222
2 bbb                333
3 ccc 444; 555; 666; 777
4 ddd           888; 999

I want to have a dataframe like this:

aaa 111
aaa 222
bbb 333
ccc 444
ccc 555
ccc 666
ccc 777
ddd 888
ddd 999

I found a wonderful solution to convert a similar list to dataframe in previous Stack Overflow questions. However, it is difficult for me to convert it from a dataframe with multiple entries. How can I do this?

like image 683
a83 Avatar asked Dec 07 '22 21:12

a83


1 Answers

Here is a simple base R solution (explanation below):

spl <- with(df1, strsplit(as.charcter(B), split = "; ", fixed = TRUE))
lens <- sapply(spl, length)
out <- with(df1, data.frame(A = rep(A, lens), B = unlist(spl)))

Which gives us:

R> out
    A   B
1 aaa 111
2 aaa 222
3 bbb 333
4 ccc 444
5 ccc 555
6 ccc 666
7 ccc 777
8 ddd 888
9 ddd 999

What is the code doing? Line 1:

spl <- with(df1, strsplit(as.character(B), split = "; ", fixed = TRUE))

breaks apart each of the strings in B using "; " as the characters to split on. We use fixed = TRUE (as suggested by @Marek in the comments) to speed up the matching and splitting as in this case we do not need to match using a regular expression, we simply want to match on the stated string. This gives us a list with the various elements split out:

R> spl
[[1]]
[1] "111" "222"

[[2]]
[1] "333"

[[3]]
[1] "444" "555" "666" "777"

[[4]]
[1] "888" "999"

The next line simply counts how many elements there are in each component of the list spl

lens <- sapply(spl, length)

which gives us a vectors of lengths:

R> lens
[1] 2 1 4 2

The final line of the solution plugs the outputs from the two previous steps into a new data frame. The trick is to repeat each element of df1$A lens number of times; for which we use the rep() function. We also need to unwrap the list spl into a vector which we do with unlist():

out <- with(df1, data.frame(A = rep(A, lens), B = unlist(spl)))
like image 134
Gavin Simpson Avatar answered Jan 18 '23 03:01

Gavin Simpson