Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split strings into columns in R where each string has a potentially different number of column entries

Tags:

r

I've got a data frame that's got the following form

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12

And what I need to do is split the first column at the commas and create enough new columns to cover the longest sequence. The result should be:

First Page      Second Page  Third Page     Count
page 1          page 2       page 3         23
page 2          page 4       null           4
page 1          page 3       page 4         12

I'm fine if the null is a zero-length string, and I can handle stripping off the brackets.

like image 271
TWAndrews Avatar asked Feb 24 '13 20:02

TWAndrews


2 Answers

My "splitstackshape" package has a function that addresses this kind of problem. The relevant function in this case is concat.split and works as follows (using "myDat" from Ricardo's answer):

# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
#   and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
#   count pages_1 pages_2 pages_3
# 1    23  page 1  page 2  page 3
# 2     4  page 2  page 4        
# 3    12  page 1  page 3  page 4
like image 76
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 26 '22 23:10

A5C1D2H2I1M1N2O1R2T1


sample data

myDat <- read.table(text=
  "pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|") 

We can pull pages out of myDat to work on it.

# if factors, convert to characters
pages <- as.character(myDat$page)

# remove brackets.  Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)

# split on comma
pages <- strsplit(pages, ",")

# find the largest element
maxLen <- max(sapply(pages, length))

# fill in any blanks. The t() is to transpose the return from sapply
pages <- 
t(sapply(pages, function(x)
      # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
      c(x, rep(NA, maxLen - length(x)))
  ))

# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")

# Put it all back together
data.frame(pages, Count=myDat$count)



Results

> data.frame(pages, Count=myDat$count)
  First.Page Second.Page Third.Page Count
1     page 1      page 2     page 3    23
2     page 2      page 4       <NA>     4
3     page 1      page 3     page 4    12
like image 37
Ricardo Saporta Avatar answered Oct 26 '22 23:10

Ricardo Saporta