I have a dataset of logs:
V1 duration id startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211
I'm trying to extract info from the first column (timepoint, process, pid, url, etc.). At first I tried:
df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)
it returned something like 161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<
, then I tried:
df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)
it worked but it won't work when dealing with text like process name, so I searched 'regex minimal match' and found the term non-greedy
. I tried again:
df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\\1", df$V1)
Not every row contains all the info and the problem occurred. If there's no info about the software name or the company name, R would simply copy V1 into the new var. If software version info is at the end of V1, then the regex ".*V<=>(.*?)\\[=\\].*"
would also copy the whole string to the new var:
V1 duration id startpoint timepoint process pid url addr tab ver window name company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51 161 explorer.exe 1820 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 20094 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 195 360Safe.exe 1732 T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7, 5, 0, 1501 1017e 360安全卫士 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 203 360chrome.exe 436 NULL 2027a 20290 5.2.0.804 T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 360极速浏览器 360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51 209 360Safe.exe 1732 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 1017e T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211 360chrome.exe 436 www.hao123.com 2027a 20290 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804
I thought that if R can't find 'C<=>' (for example) then there's no (.*?) after that. It would be an empty string but the output took the whole string. Can anybody help me to fix it? Thanks!
Thanks to MrFlick's comment, I just got a solution based on this answer:
Take the process of extracting software name info as an example,
ind1 <- grep(".*N<=>(.*?)\\[=\\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- ""
df$name[ind2] <- gsub(".*N<=>(.*?)", "\\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1) # replace the ones with pattern match and follow-up
But this snippet seems lousy and if it's the final solution I have to go through it with the other info (process, pid, version, company, etc.)... could someone help to optimize it? Thanks!
Here's another strategy. We can use gregexpr
to separate each of the pieces of the stacked data. Here's the data in a vector
V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512",
"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn",
"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501",
"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")
Now we can split the pieces with
m <- gregexpr("(\\w)<=>(.*?)(?:\\[=\\]|$)", V1, perl=T)
Getting out caputred matches can be a mess, but I use the function regcapturedmatches to easily get at all the matched data. I use it like you would use the builtin regmatches
data <- regcapturedmatches(V1,m)
Then if you inspect data
you can see all the info is there. Now the problem is we just need to build it up as columns rather than rows as it is now. To do that I use reshape2
library(reshape2)
#combine list into one data.frame
sdata<-do.call(rbind, lapply(1:length(data),
function(i) data.frame(data[[i]], S=i)))
#turn rows into columns
dcast(sdata, S~X1, value.var="X2")
And that returns
S I P T V W C N A B
1 1 1820 explorer.exe 161 6.00.2900.5512 20094 <NA> <NA> <NA> <NA>
2 2 1732 360Safe.exe 195 7, 5, 0, 1501 1017e 360.cn 360安全卫士 <NA> <NA>
3 3 1732 360Safe.exe 209 7, 5, 0, 1501 1017e <NA> <NA> <NA> <NA>
4 4 436 360chrome.exe 211 5.2.0.804 <NA> <NA> <NA> 2027a 20290
U
1 <NA>
2 <NA>
3 <NA>
4 www.hao123.com
You can rename columns and such, but it's really not all that much code to do all the transformations at once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With