I would like to match some specific string using R and keep only the line above that match, here is some example data. Having a file with hundreds of similar cases:
first_case<- data.frame(line =
c("#John Wayne: Su, 11.01.2013 08:24:42#
He is present / I guess, Does great job
--------------------------------------------------
#Michal Thorn: Fr, 12.09.2015 17:23:01#
Works quite frequently with people
--------------------------------------------------
#Sandra Nunes: Mo, 20.05.2011 09:00:29#
She has some new clients"))
second_case<- data.frame(line =
c("#Boris Jonson: Mo, 30.09.2017 09:20:42#
He is present
--------------------------------------------------
#Jacky Fine: Th, 02.02.2013 18:23:01#
Does great job
--------------------------------------------------
#Michael Bissping: Mo, 25.03.2012 10:00:29#
Hard to count on"))
third_case<- data.frame(line =
c("#Isabelle Warren: Sa, 02.12.2013 02:24:42#
Not around / anymore
--------------------------------------------------
#Tobias Maker: Mo, 02.03.2013 10:23:01#
Works quite frequently with people
--------------------------------------------------
#Toe Michael : Mo, 20.05.2011 09:00:29#
She has some new clients & Does great job"))
all_cases <- rbind(first_case,second_case,third_case)
Here I try to filter those lines which are 1 line above:
Does great job
By looking if Does great job
ends with new line and take the first line above:
dplyr::filter(all_cases, grepl("((.*\n){1})Does great job",line))
Expected results:
first_case<- data.frame(line =
c("#John Wayne: Su, 11.01.2013 08:24:42#"))
second_case<- data.frame(line =
c("#Jacky Fine: Th, 02.02.2013 18:23:01#"))
third_case<- data.frame(line =
c("#Toe Michael : Mo, 20.05.2011 09:00:29#"))
expected_result <- rbind(first_case,second_case,third_case)
1 #John Wayne: Su, 11.01.2013 08:24:42#
2 #Jacky Fine: Th, 02.02.2013 18:23:01#
3 #Toe Michael : Mo, 20.05.2011 09:00:29#
Unfortunately, this returns zero rows. Appreciate any insights!
You could try :
library(stringr)
library(dplyr)
all_cases %>% transmute(x=str_extract(line,".*(?=\n.*?Does great job)"))
# x
#1 #John Wayne: Su, 11.01.2013 08:24:42#
#2 #Jacky Fine: Th, 02.02.2013 18:23:01#
#3 #Toe Michael : Mo, 20.05.2011 09:00:29#
Improved solution, in order to exploit independantly each line of each bunch of three persons :
all_cases %>% separate(line,c("a","b","c"),sep="-{3,}") %>%
gather(k,v,a,b,c) %>%
transmute(x=str_extract(v,".*(?=\n.*?Does great job)")) %>%
filter(!is.na(x))
Here is one base R approach using strsplit
. We can form a list/vector of lines, and then directly use grep
to find the index of the line matching Does great job
. Then, just return the line which immediately precedes that.
line <- "#Boris Jonson: Mo, 30.09.2017 09:20:42#
He is present
--------------------------------------------------
#Jacky Fine: Th, 02.02.2013 18:23:01#
Does great job
--------------------------------------------------
#Michael Bissping: Mo, 25.03.2012 10:00:29#
Hard to count on"
terms <- unlist(strsplit(line, "\n"))
terms[grep("Does great job", terms) - 1]
[1] " #Jacky Fine: Th, 02.02.2013 18:23:01#"
There are a number of edge cases which my answer does not cover, the first being the match logic. What should happen if the search term matches more than once, or not at all? Also, how specific should the pattern used in grep
be?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With