I am trying to read bunch of this type of files using R to parse out the information and put the data in a data frame like format:
this is the contents of the file:
last_run current_run seconds
------------------------------- ------------------------------- -----------
Jul 4 2016 7:17AM Jul 4 2016 7:21AM 226
Engine Utilization (Tick %) User Busy System Busy I/O Busy Idle
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Engine 0 5.0 % 0.4 % 22.4 % 72.1 %
Engine 1 3.9 % 0.5 % 22.8 % 72.8 %
Engine 2 5.6 % 0.3 % 22.5 % 71.6 %
Engine 3 5.1 % 0.4 % 22.7 % 71.8 %
------------------------- ------------ ------------ ---------- ----------
Pool Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
------------------------- ------------ ------------ ---------- ----------
Server Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
Transaction Profile
-------------------
Transaction Summary per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Committed Xacts 137.3 n/a 41198 n/a
Average Runnable Tasks 1 min 5 min 15 min % of total
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Global Queue 0.0 0.0 0.0 0.0 %
Engine 0 0.0 0.1 0.1 0.6 %
Engine 1 0.0 0.0 0.0 0.0 %
Engine 2 0.2 0.1 0.1 2.6 %
------------------------- ------------ ------------ ----------
Pool Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
------------------------- ------------ ------------ ----------
Server Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
Device Activity Detail
----------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_125
datadev_125 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_126
datadev_126 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_127
datadev_127 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 0.0 0.0 5 0.4 %
Non-APF 0.0 0.0 1 0.1 %
Writes 3.8 0.0 1128 99.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 3.8 0.0 1134 0.1 %
Mirror Semaphore Granted 3.8 0.0 1134 100.0 %
Mirror Semaphore Waited 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/sybaser/database/sybaseR/dev/sybaseR.datadev_000
GPS_datadev_000 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 7.9 0.0 2372 55.9 %
Non-APF 5.5 0.0 1635 38.6 %
Writes 0.8 0.0 233 5.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 14.1 0.0 4240 0.3 %
Mirror Semaphore Granted 14.1 0.0 4239 100.0 %
Mirror Semaphore Waited 0.0 0.0 2 0.0 %
I need to capture "Jul 4 2016 7:21AM" as Date, from "Engine Utilization (Tick%) line, Server Summary ->Average "4.2%"
From "Transaction Profile" section ->Transaction Profile "count" entry.
so, my data frame should look something like this:
Date Cpu Count
Jul 4 2016 7:21AM 4.2 41198
Can somebody help me how to parse this file to get these output?
I have tried something like this:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
to get this line:
Average 4.2 % 0.3 % 22.9 % 72.5 %
But I want to be able to only extract Average right after
Engine Utilization (Tick %), since there could be many lines that start with Average. The Average line that shows up right after Engine Utilization (Tick %), is the one I want.
How do I put that in this line to extract this information from this file:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
Can I use grep in this read.table line to search for certain characters?
%%%% Shot 1 -- got something working
extract <- function(filenam="file.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}
print(extract("file.txt"))
##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))
print(merged)
file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop
%%% Shot 2: 1st attempt to extract a (potentially variable) number of device columns
extractv2 <- function(filenam="file2.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
## Total I/Os
## 1. Each line "Device:" is assumed to be the header of a block of lines
## containing info about a single device (there are 4 such blocks
## in your example);
## 2. each block is assumed to contain one or more lines matching
## "Total I/Os";
## 3. the relevant count data is assumed to be contained in the last
## of such lines (at column 4), for each block.
## Approach: loop on the line numbers of those lines matching "Device:"
## to get: A. counts; B. device names
ii_block_dev <- grep("Device:", txt, fixed=TRUE)
ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
nblocks <- length(ii_block_dev)
## A. get counts for each device
## for each block, select *last* line matching "Total I/Os"
ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
## select matching liens to "Total I/Os" within each block
IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] &
ii_lines_IOs < ii_block_dev_aux[block+1]
]
tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
})
lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
Filter(function(v) v!="", strsplit(strng," ")[[1]])
})
IOs_counts <- sapply(lines_IOs, function(v) v[5])
## B. get device names:
## assumed to be on lines following each "Device:" match
ii_devices <- 1 + ii_block_dev
device_names <- sapply(ii_devices, function(ii){
Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
})
## Create a data.frame with "device_names" as column names and "IOs_counts" as
## the values of a single row.
## Sorting the device names by order() will help produce the same column names
## if different sysmon files list the devices in different order
ord <- order(device_names)
devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
check.names=FALSE) ## Prevent R from messing with our device names
data.frame(stringsAsFactors=FALSE, check.names=FALSE,
Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))
## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)
%%%%%%% Shot 3: extract two tables, one with a single row, and a second with a variable number of rows (depending on the which devices are listed in each sysmon file).
extractv3 <- function(filenam="file2.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
## first part of output: fixed three-column structure
fixed <- data.frame(stringsAsFactors=FALSE,
Date=date_current_run, Cpu=Cpu, Count=count)
## Total I/Os
## 1. Each line "Device:" is assumed to be the header of a block of lines
## containing info about a single device (there are 4 such blocks
## in your example);
## 2. each block is assumed to contain one or more lines matching
## "Total I/Os";
## 3. the relevant count data is assumed to be contained in the last
## of such lines (at column 4), for each block.
## Approach: loop on the line numbers of those lines matching "Device:"
## to get: A. counts; B. device names
ii_block_dev <- grep("Device:", txt, fixed=TRUE)
if(length(ii_block_dev)==0){
variable <- data.frame(stringsAsFactors=FALSE,
date_current_run=date_current_run,
device_names=NA, IOs_counts=NA)
}else{
ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
nblocks <- length(ii_block_dev)
if(length(ii_block_dev)==0){
sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??")
##stop()
}
## A. get counts for each device
## for each block, select *last* line matching "Total I/Os"
ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
## select matching lines to "Total I/Os" within each block
IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] &
ii_lines_IOs < ii_block_dev_aux[block+1]
]
tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
})
lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
Filter(function(v) v!="", strsplit(strng," ")[[1]])
})
IOs_counts <- sapply(lines_IOs, function(v) v[5])
## B. get device names:
## assumed to be on lines following each "Device:" match
ii_devices <- 1 + ii_block_dev
device_names <- sapply(ii_devices, function(ii){
Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
})
## Create a data.frame with three columns: date, device, counts
variable <- data.frame(stringsAsFactors=FALSE,
date_current_run=rep(date_current_run, length(IOs_counts)),
device_names=device_names, IOs_counts=IOs_counts)
}
list(fixed=fixed, variable=variable)
}
print(extractv3("file2.txt"))
file.list <- c("file.txt","file2.txt","file3.txt")
res <- lapply(file.list, extractv3)
fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed))
print(fixed.merged)
variable.merged <- do.call("rbind", lapply(res, function(r) r$variable))
print(variable.merged)
Manipulating text files can sometimes be easier using dedicated programs. E.g. gawk
is specifically designed for finding patterns in text files and outputting data from them. We can use a short gawk script to get the required data to load into R. Note, each line of the script consists of a pattern to look for, followed by an action to take enclosed in{}
. NR
is a counter that counts number of lines read so far.
BEGIN {OFS = ""; ORS = ""}
/current_run/ {dat_line = NR+2; cpu_done = 0}
/Server Summary/ {cpu_line = NR+1}
/Transaction Summary/ {cnt_line = NR+2}
NR == dat_line {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0 {print $2," "; cpu_done = 1}
NR == cnt_line {print $5,"\n"}
Save this script with the name "ext.awk", then extract all the data files into an R data frame (assuming they are all located in one folder and have the extension .txt
) with
df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))
NB, gawk comes ready installed on most Linux versions. On windows you may need to install it from http://gnuwin32.sourceforge.net/packages/gawk.htm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With