Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Data from Text Files

Tags:

r

There appear to be similar questions to this in other languages but I can't find one in R.

I have a number of text files in the subdirectories of a directory; they all have the extension (.log) and they contain a mixture of text and data. I want to extract a couple of lines from these relatively large files.

For example, one file goes as follows ...

blahblahblah

NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =  210

blahblahblah

 ----------------------------------------<br />
 CPU timing information for all processes<br />
 ========================================<br />
 0: 8853.469 + 133.948 = 8987.417<br />
 1: 8850.817 + 126.587 = 8977.405<br />
 2: 8851.925 + 128.576 = 8980.501<br />
 3: 8847.992 + 125.871 = 8973.864<br />
 ----------------------------------------<br />
 ddikick.x: exited gracefully.<br />

blahblahblah

I want to harvest the number of basis functions (210 in this example) and the total amount of CPU times.

The line "NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =" is unique to each file; ie, if I open the file in a text editor and search using this string, I only return this one line. Similarly for "CPU timing information for all processes" and "exited gracefully".

I appreciate that it appears that I haven't done a lot to help myself but I just don't know where to start. If someone could point me in the right direction, I hope to be able to fill in the rest.

After the help given to me by @Ben (see below) here is the code that I ended up using,

filesearch <- function (x) {

f <- readLines(x)
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))
coline <- grep("^ +CPU timing information", f)
numstr <- sapply(str_extract_all(f[coline+2:5],"[0-9.]+"),as.numeric)
cline1 <- sum(numstr[4,])/60
output <- c(val, cline1)
return(cat(output,"\n"))
}

I sourced this function and keyed in the file that I needed each time, then I transferred the two results to another file by hand. Not as elegant as I'd like but it saved me a lot of time doing it this way. Thanks again to @Ben.

like image 256
DarrenRhodes Avatar asked Jan 10 '13 15:01

DarrenRhodes


People also ask

How do I extract data from a text file?

You can import data from a text file into an existing worksheet. Click the cell where you want to put the data from the text file. On the Data tab, in the Get External Data group, click From Text. In the Import Data dialog box, locate and double-click the text file that you want to import, and click Import.

How do I extract data from a text file in Excel?

Open the Excel spreadsheet where you want to save the data and click the Data tab. In the Get External Data group, click From Text. Select the TXT or CSV file you want to convert and click Import.


1 Answers

maybe

library(stringr)
f <- readLines("datafile.txt")
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))

will work?

To get the other values, try

cline <- grep("^ +CPU timing information",f)
(numstr <- sapply(str_extract_all(f[cline+2:5],"[0-9.]+"),as.numeric))
##         [,1]     [,2]     [,3]     [,4]
## [1,]    0.000    1.000    2.000    3.000
## [2,] 8853.469 8850.817 8851.925 8847.992
## [3,]  133.948  126.587  128.576  125.871
## [4,] 8987.417 8977.405 8980.501 8973.864

The sapply has transposed the matrix of values, so the last row is the bit we want (corresponds to the last column in the file). Extract it using numstr[4,] or numstr[nrow(numstr),] or tail(numstr,1).

(edit: allow spaces before the "CPU timing" string) (edit: do it right!)

(To do this for all the log files, package it in a function and use list.files(pattern="\\.log$") in combination with sapply ...)

like image 185
Ben Bolker Avatar answered Oct 01 '22 13:10

Ben Bolker