I am using RStudio 2.15.0 and have created an object from Excel using XLConnect with 3000+ rows and 12 columns I am trying to delimit/split a column into the rows but don't know if this is possible or how to do it. Example of the data below using the 3 columns in connection. any help on this would be grand.
Code that is working for 2 of the columns is below.
v1 <- with(df, tapply(PolId, Description, FUN= function(x) {
x1 <- paste(x, collapse=";")
gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1), str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)
Sample data
PolId Description Document.Type
ABC123;ABC456;ABC789; TEST1 Pol1
ABC123;ABC456;ABC789; TEST1 Pol1
ABC123;ABC456;ABC789; TEST1 Pol1
AAA123; TEST1 End1
AAA123; TEST2 End2
ABB123;ABC123; TEST3 End1
ABB123;ABC123; TEST3 End1
I want the output to be like this (replacing the duplicate Polid's)
PolId Description Document.Type
ABC123 TEST1 Pol1
ABC456 TEST1 Pol1
ABC789 TEST1 Pol1
AAA123 TEST1 End1
AAA123 TEST2 End2
ABB123 TEST3 End1
ABC123 TEST3 End1
Here is a base R solution. Split the PolId
field using strplit
and for each such split field cbind it with the corresponding Description
. This gives a list of matrices which we rbind
together. Finally set the column names.
out <- do.call(rbind, Map(cbind, strsplit(DF$PolId, ";"), DF$Description))
colnames(out) <- colnames(DF)
giving:
> out
PolId Description
[1,] "ABC123" "TEST1"
[2,] "ABC456" "TEST1"
[3,] "ABC789" "TEST1"
[4,] "ABC123" "TEST1"
[5,] "ABC456" "TEST1"
[6,] "ABC789" "TEST1"
[7,] "ABC123" "TEST1"
[8,] "ABC456" "TEST1"
[9,] "ABC789" "TEST1"
[10,] "AAA123" "TEST1"
[11,] "AAA123" "TEST2"
[12,] "ABB123" "TEST3"
[13,] "ABC123" "TEST3"
[14,] "ABB123" "TEST3"
[15,] "ABC123" "TEST3"
Note: We used this as the input:
DF <-
structure(list(PolId = c("ABC123;ABC456;ABC789;", "ABC123;ABC456;ABC789;",
"ABC123;ABC456;ABC789;", "AAA123;", "AAA123;", "ABB123;ABC123;",
"ABB123;ABC123;"), Description = c("TEST1", "TEST1", "TEST1",
"TEST1", "TEST2", "TEST3", "TEST3")), .Names = c("PolId", "Description"
), class = "data.frame", row.names = c(NA, -7L))
Here's a quick data.table
possible solution
library(data.table)
unique(setDT(df)[, .(PolId = unlist(strsplit(as.character(PolId), ";"))), by = Description])
# Description PolId
# 1: TEST1 ABC123
# 2: TEST1 ABC456
# 3: TEST1 ABC789
# 4: TEST1 AAA123
# 5: TEST2 AAA123
# 6: TEST3 ABB123
# 7: TEST3 ABC123
Per your edit- Another option (in case you have more than two columns)
library(splitstackshape)
unique(cSplit(df, "PolId", ";", "long"))
# PolId Description Document.Type
# 1: ABC123 TEST1 Pol1
# 2: ABC456 TEST1 Pol1
# 3: ABC789 TEST1 Pol1
# 4: AAA123 TEST1 End1
# 5: AAA123 TEST2 End2
# 6: ABB123 TEST3 End1
# 7: ABC123 TEST3 End1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With