Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract lines above a match using regex and R?

Tags:

regex

r

I would like to match some specific string using R and keep only the line above that match, here is some example data. Having a file with hundreds of similar cases:

first_case<- data.frame(line = 

             c("#John Wayne: Su, 11.01.2013 08:24:42#
                He is present / I guess, Does great job
                --------------------------------------------------
                #Michal Thorn: Fr, 12.09.2015 17:23:01#
                Works quite frequently with people
                --------------------------------------------------
                #Sandra Nunes: Mo, 20.05.2011 09:00:29#
                She has some new clients"))



second_case<- data.frame(line = 

                c("#Boris Jonson: Mo, 30.09.2017 09:20:42#
                He is present
                --------------------------------------------------
                #Jacky Fine: Th, 02.02.2013 18:23:01#
                Does great job
                --------------------------------------------------
                #Michael Bissping: Mo, 25.03.2012 10:00:29#
                Hard to count on"))



third_case<- data.frame(line = 

              c("#Isabelle Warren: Sa, 02.12.2013 02:24:42#
                 Not around / anymore
               --------------------------------------------------
                 #Tobias Maker: Mo, 02.03.2013 10:23:01#
                 Works quite frequently with people
               --------------------------------------------------
                 #Toe Michael : Mo, 20.05.2011 09:00:29#
                 She has some new clients & Does great job"))

all_cases <- rbind(first_case,second_case,third_case)

Here I try to filter those lines which are 1 line above:

Does great job

By looking if Does great job ends with new line and take the first line above:

dplyr::filter(all_cases, grepl("((.*\n){1})Does great job",line))

Expected results:

first_case<- data.frame(line = 
                      c("#John Wayne: Su, 11.01.2013 08:24:42#"))
second_case<- data.frame(line = 
                       c("#Jacky Fine: Th, 02.02.2013 18:23:01#"))
third_case<- data.frame(line = 
                      c("#Toe Michael : Mo, 20.05.2011 09:00:29#"))

expected_result <- rbind(first_case,second_case,third_case)

1   #John Wayne: Su, 11.01.2013 08:24:42#
2   #Jacky Fine: Th, 02.02.2013 18:23:01#
3   #Toe Michael : Mo, 20.05.2011 09:00:29#

Unfortunately, this returns zero rows. Appreciate any insights!

like image 787
Googme Avatar asked Aug 02 '18 10:08

Googme


2 Answers

You could try :

library(stringr)
library(dplyr)

all_cases %>% transmute(x=str_extract(line,".*(?=\n.*?Does great job)"))

#                                                         x
#1                    #John Wayne: Su, 11.01.2013 08:24:42#
#2                    #Jacky Fine: Th, 02.02.2013 18:23:01#
#3                  #Toe Michael : Mo, 20.05.2011 09:00:29#

Improved solution, in order to exploit independantly each line of each bunch of three persons :

all_cases %>% separate(line,c("a","b","c"),sep="-{3,}") %>%
  gather(k,v,a,b,c) %>%
  transmute(x=str_extract(v,".*(?=\n.*?Does great job)")) %>%
  filter(!is.na(x))
like image 80
Nicolas2 Avatar answered Sep 29 '22 11:09

Nicolas2


Here is one base R approach using strsplit. We can form a list/vector of lines, and then directly use grep to find the index of the line matching Does great job. Then, just return the line which immediately precedes that.

line <- "#Boris Jonson: Mo, 30.09.2017 09:20:42#
         He is present
         --------------------------------------------------
         #Jacky Fine: Th, 02.02.2013 18:23:01#
         Does great job
         --------------------------------------------------
         #Michael Bissping: Mo, 25.03.2012 10:00:29#
         Hard to count on"

terms <- unlist(strsplit(line, "\n"))
terms[grep("Does great job", terms) - 1]

[1] "                #Jacky Fine: Th, 02.02.2013 18:23:01#"

Demo

There are a number of edge cases which my answer does not cover, the first being the match logic. What should happen if the search term matches more than once, or not at all? Also, how specific should the pattern used in grep be?

like image 23
Tim Biegeleisen Avatar answered Sep 29 '22 11:09

Tim Biegeleisen