Consider the below hypothetical data:
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
Did you notice that there is an ":" at different locations. For example:
What I want to do, create two columns such that:
Wanted Output for 'x':
Col1 Col2
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
Wanted Output for 'y' (because ":" is not found within first three sentences, therefore) :
Col1 Col2
NA There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
Just like above result for 'y', the Wanted Output result for 'z' should be:
Col1 Col2
NA all of the text from 'z'
What I am trying to do is:
resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))
resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))
resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))
And then merging above into a resulting dataframe "resDF" using rbind.
Issues are:
You can try with this negative look ahead regex:
^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$
Regex Demo and Detailed explanation of the regex
Updated:
If your condition is met then the regex will return true and you should get 2 part
group 1 contains the value until first : and group 2 will contain value after that.
If condition not met then you copy the whole string to column 2 and put whatever you want as column 1
An updated sample snippet which contains a method named process data will do the tricks for you. if the condition is met then it will split the data and put in col1 and col2.... if the condition is not met in case of y and z in your input... it will put NA in the col1 and the entire value in col2.
Run the Sample Source --> ideone:
library(stringr)
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)
processData <- function(a) {
patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"
if(grepl(patt,a,perl=TRUE))
{
result<-str_match(a,patt)
col1<-result[2]
col2<-result[3]
}
else
{
col1<-"NA"
col2<-a
}
return(c(col1,col2))
}
for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}
print(resDF)
Sample Output:
Col1
1 There is a horror movie running in the iNox theater.
2 NA
3 NA
Col2
1 If row names are supplied of length one and the data \n frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With