how do you parse a sysmon file to extract certain information using R?

Question

I am trying to read bunch of this type of files using R to parse out the information and put the data in a data frame like format:

this is the contents of the file:

    last_run                        current_run                     seconds     
 ------------------------------- ------------------------------- ----------- 
             Jul  4 2016  7:17AM             Jul  4 2016  7:21AM         226 


Engine Utilization (Tick %)   User Busy   System Busy    I/O Busy        Idle
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Engine 0                         5.0 %         0.4 %      22.4 %      72.1 % 
   Engine 1                         3.9 %         0.5 %      22.8 %      72.8 % 
   Engine 2                         5.6 %         0.3 %      22.5 %      71.6 % 
   Engine 3                         5.1 %         0.4 %      22.7 %      71.8 % 

     -------------------------  ------------  ------------  ----------  ---------- 
  Pool Summary        Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

  -------------------------  ------------  ------------  ----------  ---------- 
  Server Summary      Total       336.1 %        25.6 %    1834.6 %    5803.8 % 
                    Average         4.2 %         0.3 %      22.9 %      72.5 % 

Transaction Profile
-------------------

  Transaction Summary             per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Committed Xacts                 137.3           n/a       41198     n/a     

     Average Runnable Tasks            1 min         5 min      15 min  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  ThreadPool : syb_default_pool                                                 
   Global Queue                       0.0           0.0         0.0       0.0 %
   Engine 0                           0.0           0.1         0.1       0.6 %
   Engine 1                           0.0           0.0         0.0       0.0 %
   Engine 2                           0.2           0.1         0.1       2.6 %

  -------------------------  ------------  ------------  ----------             
  Pool Summary        Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1             

  -------------------------  ------------  ------------  ----------             
  Server Summary      Total           7.2           5.9         6.1             
                    Average           0.1           0.1         0.1 

Device Activity Detail
  ----------------------

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_125                                         
    datadev_125                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_126                                         
    datadev_126                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       n/a   
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          0.0           0.0           0       0.0 %


  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /dev/vx/rdsk/sybaserdatadg/datadev_127                                         
    datadev_127                   per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             0.0           0.0           5       0.4 %
      Non-APF                         0.0           0.0           1       0.1 %
    Writes                            3.8           0.0        1128      99.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                          3.8           0.0        1134       0.1 %

  Mirror Semaphore Granted            3.8           0.0        1134     100.0 %
  Mirror Semaphore Waited             0.0           0.0           0       0.0 %

  ----------------------------------------------------------------------------- 

  Device:                                                                       
    /sybaser/database/sybaseR/dev/sybaseR.datadev_000                                    
    GPS_datadev_000               per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ---------- 
    Reads                                                                       
      APF                             7.9           0.0        2372      55.9 %
      Non-APF                         5.5           0.0        1635      38.6 %
    Writes                            0.8           0.0         233       5.5 %
  -------------------------  ------------  ------------  ----------  ---------- 
  Total I/Os                         14.1           0.0        4240       0.3 %

  Mirror Semaphore Granted           14.1           0.0        4239     100.0 %
  Mirror Semaphore Waited             0.0           0.0           2       0.0 %

I need to capture "Jul 4 2016 7:21AM" as Date, from "Engine Utilization (Tick%) line, Server Summary ->Average "4.2%"

From "Transaction Profile" section ->Transaction Profile "count" entry.

so, my data frame should look something like this:

Date                     Cpu   Count
Jul  4 2016  7:21AM      4.2   41198

Can somebody help me how to parse this file to get these output?

I have tried something like this:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

to get this line:

Average         4.2 %         0.3 %      22.9 %      72.5 %

But I want to be able to only extract Average right after

Engine Utilization (Tick %), since there could be many lines that start with Average. The Average line that shows up right after Engine Utilization (Tick %), is the one I want.

How do I put that in this line to extract this information from this file:

read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])

Can I use grep in this read.table line to search for certain characters?

renato vitolo · Accepted Answer

%%%% Shot 1 -- got something working

extract <- function(filenam="file.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}

print(extract("file.txt"))

##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))

print(merged)

file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop

%%% Shot 2: 1st attempt to extract a (potentially variable) number of device columns

extractv2 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]


    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
    nblocks <- length(ii_block_dev)
    ## A. get counts for each device
    ## for each block, select *last* line matching "Total I/Os"
    ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
    ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
        ## select matching liens to "Total I/Os" within each block
        IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                       ii_lines_IOs < ii_block_dev_aux[block+1]
                                   ]
        tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
    })
    lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
        Filter(function(v) v!="", strsplit(strng," ")[[1]])
    })
    IOs_counts <- sapply(lines_IOs, function(v) v[5])
    ## B. get device names:
    ## assumed to be on lines following each "Device:" match
    ii_devices <- 1 + ii_block_dev
    device_names <- sapply(ii_devices, function(ii){
        Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    })
    ## Create a data.frame with "device_names" as column names and "IOs_counts" as
    ## the values of a single row.
    ## Sorting the device names by order() will help produce the same column names
    ## if different sysmon files list the devices in different order
    ord <- order(device_names)
    devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
                             check.names=FALSE) ## Prevent R from messing with our device names

    data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))


## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)

%%%%%%% Shot 3: extract two tables, one with a single row, and a second with a variable number of rows (depending on the which devices are listed in each sysmon file).

extractv3 <- function(filenam="file2.txt"){
    txt <- readLines(filenam)

    ## date of current run:
    ## assumed to be on 2nd line following the first line matching "current_run"
    ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
    line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    date_current_run <- paste(line_current_run[5:8], collapse=" ")


    ## Cpu:
    ## assumed to be on line following the first line matching "Server Summary"
    ## which comes after the first line matching "Engine Utilization ..."
    jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
    ii <- grep("Server Summary",txt, fixed=TRUE)
    ii <- 1 + min(ii[ii>jj])
    line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    Cpu <- line_Cpu[2]


    ## Count:
    ## assumed to be on 2nd line following the first line matching "Transaction Summary"
    ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
    line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
    count <- line_count[5]

    ## first part of output: fixed three-column structure
    fixed <-  data.frame(stringsAsFactors=FALSE,
                         Date=date_current_run, Cpu=Cpu, Count=count)

    ## Total I/Os
    ## 1. Each line "Device:" is assumed to be the header of a block of lines
    ##    containing info about a single device (there are 4 such blocks
    ##    in your example);
    ## 2. each block is assumed to contain one or more lines matching
    ##    "Total I/Os";
    ## 3. the relevant count data is assumed to be contained in the last
    ##    of such lines (at column 4), for each block.
    ## Approach: loop on the line numbers of those lines matching "Device:"
    ## to get: A. counts; B. device names
    ii_block_dev <- grep("Device:", txt, fixed=TRUE)
    if(length(ii_block_dev)==0){
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=date_current_run,
                               device_names=NA, IOs_counts=NA)
    }else{
        ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
        nblocks <- length(ii_block_dev)
        if(length(ii_block_dev)==0){
            sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??")
            ##stop()
        }
        ## A. get counts for each device
        ## for each block, select *last* line matching "Total I/Os"
        ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
        ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
            ## select matching lines to "Total I/Os" within each block
            IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block  ] &
                                           ii_lines_IOs < ii_block_dev_aux[block+1]
                                          ]
            tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
        })
        lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
            Filter(function(v) v!="", strsplit(strng," ")[[1]])
        })
        IOs_counts <- sapply(lines_IOs, function(v) v[5])
        ## B. get device names:
        ## assumed to be on lines following each "Device:" match
        ii_devices <- 1 + ii_block_dev
        device_names <- sapply(ii_devices, function(ii){
            Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
        })
        ## Create a data.frame with three columns: date, device, counts
        variable <- data.frame(stringsAsFactors=FALSE,
                               date_current_run=rep(date_current_run, length(IOs_counts)),
                               device_names=device_names, IOs_counts=IOs_counts)
    }
    list(fixed=fixed, variable=variable)
}
print(extractv3("file2.txt"))


file.list <- c("file.txt","file2.txt","file3.txt")
res <- lapply(file.list, extractv3)

fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed))
print(fixed.merged)

variable.merged <- do.call("rbind", lapply(res, function(r) r$variable))
print(variable.merged)

dww · Answer

Manipulating text files can sometimes be easier using dedicated programs. E.g. gawk is specifically designed for finding patterns in text files and outputting data from them. We can use a short gawk script to get the required data to load into R. Note, each line of the script consists of a pattern to look for, followed by an action to take enclosed in{}. NR is a counter that counts number of lines read so far.

BEGIN                          {OFS = ""; ORS = ""}
/current_run/                  {dat_line = NR+2; cpu_done = 0}
/Server Summary/               {cpu_line = NR+1}
/Transaction Summary/          {cnt_line = NR+2}
NR == dat_line                 {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0  {print $2," "; cpu_done = 1}
NR == cnt_line                 {print $5,"
"}

Save this script with the name "ext.awk", then extract all the data files into an R data frame (assuming they are all located in one folder and have the extension .txt) with

df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))

NB, gawk comes ready installed on most Linux versions. On windows you may need to install it from http://gnuwin32.sourceforge.net/packages/gawk.htm

how do you parse a sysmon file to extract certain information using R?

Tags:

regex

parsing

r

pcre

user1471980

2 Answers

renato vitolo

dww

Recent Activity

Donate For Us

how do you parse a sysmon file to extract certain information using R?

Tags:

regex

parsing

r

pcre

user1471980

2 Answers

renato vitolo

dww

Related questions

Recent Activity

Donate For Us