Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-greedy gsub

Tags:

regex

r

gsub

I have a dataset of logs:

V1  duration  id  startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771    1   2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211

I'm trying to extract info from the first column (timepoint, process, pid, url, etc.). At first I tried:

df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)

it returned something like 161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<, then I tried:

df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)

it worked but it won't work when dealing with text like process name, so I searched 'regex minimal match' and found the term non-greedy. I tried again:

df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\\1", df$V1)

Not every row contains all the info and the problem occurred. If there's no info about the software name or the company name, R would simply copy V1 into the new var. If software version info is at the end of V1, then the regex ".*V<=>(.*?)\\[=\\].*" would also copy the whole string to the new var:

V1  duration  id  startpoint  timepoint process pid url addr  tab ver window  name  company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51 161 explorer.exe    1820    T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  20094   T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771    1   2012-05-07_12-29-51 195 360Safe.exe 1732    T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7, 5, 0, 1501   1017e   360安全卫士 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn    7771    1   2012-05-07_12-29-51 203 360chrome.exe   436 NULL    2027a   20290   5.2.0.804   T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn    360极速浏览器    360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51 209 360Safe.exe 1732    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    1017e   T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211 360chrome.exe   436 www.hao123.com  2027a   20290   T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804

I thought that if R can't find 'C<=>' (for example) then there's no (.*?) after that. It would be an empty string but the output took the whole string. Can anybody help me to fix it? Thanks!

Update

Thanks to MrFlick's comment, I just got a solution based on this answer:

Take the process of extracting software name info as an example,

ind1 <- grep(".*N<=>(.*?)\\[=\\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- "" 
df$name[ind2] <- gsub(".*N<=>(.*?)", "\\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1) # replace the ones with pattern match and follow-up

But this snippet seems lousy and if it's the final solution I have to go through it with the other info (process, pid, version, company, etc.)... could someone help to optimize it? Thanks!

like image 436
leoce Avatar asked Oct 20 '22 07:10

leoce


1 Answers

Here's another strategy. We can use gregexpr to separate each of the pieces of the stacked data. Here's the data in a vector

V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512", 
"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn", 
"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501", 
"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")

Now we can split the pieces with

m <- gregexpr("(\\w)<=>(.*?)(?:\\[=\\]|$)", V1, perl=T)

Getting out caputred matches can be a mess, but I use the function regcapturedmatches to easily get at all the matched data. I use it like you would use the builtin regmatches

data <- regcapturedmatches(V1,m)

Then if you inspect data you can see all the info is there. Now the problem is we just need to build it up as columns rather than rows as it is now. To do that I use reshape2

library(reshape2)

#combine list into one data.frame
sdata<-do.call(rbind, lapply(1:length(data), 
    function(i) data.frame(data[[i]], S=i)))    

#turn rows into columns
dcast(sdata, S~X1, value.var="X2")

And that returns

  S    I             P   T              V     W      C           N     A     B
1 1 1820  explorer.exe 161 6.00.2900.5512 20094   <NA>        <NA>  <NA>  <NA>
2 2 1732   360Safe.exe 195  7, 5, 0, 1501 1017e 360.cn 360安全卫士  <NA>  <NA>
3 3 1732   360Safe.exe 209  7, 5, 0, 1501 1017e   <NA>        <NA>  <NA>  <NA>
4 4  436 360chrome.exe 211      5.2.0.804  <NA>   <NA>        <NA> 2027a 20290
               U
1           <NA>
2           <NA>
3           <NA>
4 www.hao123.com

You can rename columns and such, but it's really not all that much code to do all the transformations at once.

like image 141
MrFlick Avatar answered Oct 23 '22 01:10

MrFlick