Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract data elements found in a single column

Here is what my data look like.

id interest_string
1       YI{Z0{ZI{
2             ZO{
3            <NA>
4             ZT{

As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.

How can I manipulate this data frame to extract the values into a format like this:

id  interest
1    YI
1    Z0
1    ZI
2    Z0
3    <NA>
4    ZT

I need to complete this task with R.

Thanks in advance.

like image 392
Btibert3 Avatar asked Dec 25 '22 16:12

Btibert3


2 Answers

This is one solution

out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))

out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
                        interest = unlist(out, use.names = FALSE))

Giving:

R> out
  id interest
1  1       YI
2  1       Z0
3  1       ZI
4  2       ZO
5  3     <NA>
6  4       ZT

Explanation

The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data

R> out
[[1]]
[1] "YI" "Z0" "ZI"

[[2]]
[1] "ZO"

[[3]]
[1] "<NA>"

[[4]]
[1] "ZT"

We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.

Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

like image 132
Gavin Simpson Avatar answered Dec 28 '22 05:12

Gavin Simpson


A nice and tidy data.table solution:

library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1       YI{Z0{ZI{
2             ZO{
3            <NA>
4             ZT{"), header=TRUE))

DT$interest_string <- as.character(DT$interest_string)

DT[, {
  list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]

gives me

   id interest
1:  1       YI
2:  1       Z0
3:  1       ZI
4:  2       ZO
5:  3     <NA>
6:  4       ZT
like image 33
Kevin Ushey Avatar answered Dec 28 '22 07:12

Kevin Ushey