I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):
time subject var1 var2 var3
t1 subjectA ...
t2 subjectB ...
and
subject name
subjectA nameA
subjectB nameB
However, the file generated writes a data file in a format similar to the following:
time t1
description
subjectA nameA
var1 var2 var3
subjectB nameB
var1 var2 var3
time t2
description
subjectA nameA
var1 var2 var3
subjectB nameB
var1 var2 var3
...(and so on)
I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:
SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'
I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.
Thanks!
Postscript: Wow, thanks to all! I am sorry that I cannot choose everyone's answers @FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean. @PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad). @ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later). @weismat: Well stated. @S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.
Again, I am grateful to everyone - thanks so much.
This is what Python generators are all about.
def read_as_flat( someFile ):
line_iter= iter(someFile)
time_header= None
for line in line_iter:
words = line.split()
if words[0] == 'time':
time_header = [ words[1:] ] # the "time" line
description= line_iter.next()
time_header.append( description )
elif words[0] in subjectNameSet:
data = line_iter.next()
yield time_header + data
You can use this like a standard Python iterator
for time, description, var1, var2, var3 in read_as_flat( someFile ):
etc.
If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:
grep -A 1 'subjectB'
The -A 1
command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).
You might want to use the -E
option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB'
).
Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:
grep -A 1 'subjectB' |grep -v 'subjectB'
And you may wish to process the variable line:
grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With