Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

processing text from a non-flat file (to extract information as if it *were* a flat file)

I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):

time subject var1 var2 var3
t1   subjectA  ...
t2   subjectB  ...

and

subject   name
subjectA  nameA
subjectB  nameB

However, the file generated writes a data file in a format similar to the following:

time t1 
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
time t2
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
...(and so on)

I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:

SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.

Thanks!

Postscript: Wow, thanks to all! I am sorry that I cannot choose everyone's answers @FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean. @PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad). @ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later). @weismat: Well stated. @S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.

Again, I am grateful to everyone - thanks so much.

like image 481
hatmatrix Avatar asked Dec 29 '22 02:12

hatmatrix


2 Answers

This is what Python generators are all about.

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

You can use this like a standard Python iterator

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.
like image 88
S.Lott Avatar answered Dec 31 '22 14:12

S.Lott


If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:


  grep -A 1 'subjectB'

The -A 1 command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).

You might want to use the -E option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB').

Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:


  grep -A 1 'subjectB' |grep -v 'subjectB'

And you may wish to process the variable line:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'
like image 28
PP. Avatar answered Dec 31 '22 14:12

PP.