I need to process a big data file that contains multi-line records, example input:
1 Name Dan
1 Title Professor
1 Address aaa street
1 City xxx city
1 State yyy
1 Phone 123-456-7890
2 Name Luke
2 Title Professor
2 Address bbb street
2 City xxx city
3 Name Tom
3 Title Associate Professor
3 Like Golf
4 Name
4 Title Trainer
4 Likes Running
Note that the first integer field is unique and really identifies a whole record. So in the above input I really have 4 records although I dont know how many lines of attributes each records may have. I need to: - identify valid record (must have "Name" and "Title" field) - output the available attributes for each valid record, say "Name", "Title", "Address" are needed fields.
Example output:
1 Name Dan
1 Title Professor
1 Address aaa street
2 Name Luke
2 Title Professor
2 Address bbb street
3 Name Tom
3 Title Associate Professor
So in the output file, record 4 is removed since it doen't have the "Name" field. Record 3 doesn't have Address field but still being print to the output since it is a valid record that has "Name" and "Title".
Can I do this with awk? But how do i identify a whole record using the first "id" field on each line?
Thanks a lot to the unix shell script expert for helping me out! :)
This seems to work. There are MANY ways you could do this, even in awk.
I've spaced it out for easier reading.
Note that record 3 doesn't show up because it's missing an "Address" field, which you identified as required.
#!/usr/bin/awk -f
BEGIN {
# Set your required fields here...
required["Name"]=1;
required["Title"]=1;
required["Address"]=1;
# Count the required fields
for (i in required) enough++;
}
# Note that this will run on the first record, but only to initialize variables
$1 != last1 {
if (hits >= enough) {
printf("%s",output);
}
last1=$1; output=""; hits=0;
}
# This appends the current line to a buffer, followed by the record separator (RS)
{ output=output $0 RS }
# Count the required fields; used to determine whether to print the buffer
required[$2] { hits++ }
END {
# Print the final buffer, since we only print on the next record
if (hits >= enough) {
printf("%s",output);
}
}
I am not good at awk, but I'd solve this in Perl. Here is a Perl solution: for each record, it remembers the important lines and whether the name and title was seen. At the end of a record, the record is printed if all the conditions are met.
#!/usr/bin/perl
use warnings;
use strict;
my ($last, $has_name, $has_title, @record);
while (<DATA>) {
my ($id, $key, $value) = split;
if ($id != $last and @record) {
print @record if $has_name and $has_title;
undef @record;
undef $has_name;
undef $has_title;
}
$has_name = 1 if $key eq 'Name';
$has_title = 1 if $key eq 'Title';
push @record, $_ if grep $key eq $_, qw/Name Address Title/;
$last = $id;
}
__DATA__
1 Name Dan
1 Title Professor
1 Address aaa street
1 City xxx city
1 State yyy
1 Phone 123-456-7890
2 Name Luke
2 Title Professor
2 Address bbb street
2 City xxx city
3 Name Tom
3 Title Associate Professor
3 Like Golf
4 Name
4 Title Trainer
4 Likes Running
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With