Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error in regex pattern matching for text retrieval into two columns of a dataframe

Consider the below hypothetical data:

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

Did you notice that there is an ":" at different locations. For example:

  • In 'x' it ( ":" ) is after the first sentence.
  • In 'y' it ( ":" ) is after the fourth sentence.
  • and In 'z' it is after the sixth sentence.
  • Moreover there is one more ":" before the last sentence in the each text.

What I want to do, create two columns such that:

  • Only the first ":" is considered and NOT THE LAST ONE.
  • If there is a ":" within first three sentences, then divide the whole text into two columns otherwise, keep all the text in the second columns and 'NA' in the first column.

Wanted Output for 'x':

 Col1                                                        Col2 
 There is a horror movie running in the iNox theater.        If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

Wanted Output for 'y' (because ":" is not found within first three sentences, therefore) :

 Col1     Col2 
 NA       There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

Just like above result for 'y', the Wanted Output result for 'z' should be:

  Col1    Col2
  NA      all of the text from 'z'

What I am trying to do is:

resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

And then merging above into a resulting dataframe "resDF" using rbind.

Issues are:

  • The above can be done using "for() loop" or anyother method to make code simpler.
  • The result from "y" and "z" text are not coming as I wanted (shown above).
like image 236
Madhu Sareen Avatar asked Jan 04 '23 10:01

Madhu Sareen


1 Answers

You can try with this negative look ahead regex:

^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$

Regex Demo and Detailed explanation of the regex

Updated:

If your condition is met then the regex will return true and you should get 2 part

group 1 contains the value until first : and group 2 will contain value after that.

If condition not met then you copy the whole string to column 2 and put whatever you want as column 1

An updated sample snippet which contains a method named process data will do the tricks for you. if the condition is met then it will split the data and put in col1 and col2.... if the condition is not met in case of y and z in your input... it will put NA in the col1 and the entire value in col2.

Run the Sample Source --> ideone:

library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}    


print(resDF)

Sample Output:

                                                   Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
like image 124
Rizwan M.Tuman Avatar answered Jan 13 '23 12:01

Rizwan M.Tuman