Extract data elements found in a single column

Question

Here is what my data look like.

id interest_string
1       YI{Z0{ZI{
2             ZO{
3            <NA>
4             ZT{

As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.

How can I manipulate this data frame to extract the values into a format like this:

id  interest
1    YI
1    Z0
1    ZI
2    Z0
3    <NA>
4    ZT

I need to complete this task with R.

Thanks in advance.

Gavin Simpson · Accepted Answer

This is one solution

out <- with(dat, strsplit(as.character(interest_string), "\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))

out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
                        interest = unlist(out, use.names = FALSE))

Giving:

R> out
  id interest
1  1       YI
2  1       Z0
3  1       ZI
4  2       ZO
5  3     <NA>
6  4       ZT

Explanation

The first line of solution simply splits each element of the interest_string factor in data object dat, using \{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data

R> out
[[1]]
[1] "YI" "Z0" "ZI"

[[2]]
[1] "ZO"

[[3]]
[1] "<NA>"

[[4]]
[1] "ZT"

We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.

Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

Kevin Ushey · Answer

A nice and tidy data.table solution:

library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1       YI{Z0{ZI{
2             ZO{
3            <NA>
4             ZT{"), header=TRUE))

DT$interest_string <- as.character(DT$interest_string)

DT[, {
  list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]

gives me

   id interest
1:  1       YI
2:  1       Z0
3:  1       ZI
4:  2       ZO
5:  3     <NA>
6:  4       ZT

Extract data elements found in a single column

Tags:

r

data-manipulation

Btibert3

2 Answers

Explanation

Gavin Simpson

Kevin Ushey

Recent Activity

Donate For Us

Extract data elements found in a single column

Tags:

r

data-manipulation

Btibert3

2 Answers

Explanation

Gavin Simpson

Kevin Ushey

Related questions

Recent Activity

Donate For Us