I rarely have to deal with scripting, so I'm up against a lack of knowledge for this problem. I have a file >500mb in text, which is nicely sectioned, but I know there are 5 to 10 "bad" sections inside. The data within the sections can be evaluated pretty easily by a human, I don't know how to do it in a program. I pick up a known good value in <code>#Field MyField</code> - however if that value did not appear in <code>#FIELD LOCATION</code>, something went wrong. An example of two sections within the file looks like this. The first is 'bad' and the second is 'good'. <pre class="prettyprint lang-none prettyprint-override"><code>#START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END </code></pre> <ol> <li>Sections start and end logically, with <code>#START</code> and <code>#END</code></li> <li>If <code>#FIELD LOCATION</code> does not exist, go to next section</li> <li>If <code>#FIELD MyField="BAR"</code> and <code>#FIELD LOCATION</code> does not contain <code>BAR</code>, print all lines from this section to a new file.</li> <li>Note - Clarification of <code>#FIELD MyField="BAR"</code> - this is a check value I put in by grabbing other info about the data as this file is being built (in my case it is a language indicator, such as EN or DE. so it would literally be <code>#FIELD MyField="EN"</code> Any other value in this field would be ignored, this isn't a record that matches my criteria. </li> </ol> I believe this can be done in Awk or Perl, I can do very simple one-liners but this is beyond my skills.

One-liner: <pre class="prettyprint"><code>perl -ne 'BEGIN { $/ = "#END\n" }' -e '/MyField="(.*?)"/; print if !/Value=$1/' <file >newfile </code></pre> Sets the Input Record Separator to <code>"#END\n"</code> so perl reads the 'chunks' into <code>$_</code> one at a time, then captures the value in MyField and prints the whole chunk if <code>Value=$1</code> (that is, that capture after 'Value=') is not present. You may of course make the regexes more specific if needed.

Read a large file and output sections matching multiple parameters

Tags:

awk

perl

I rarely have to deal with scripting, so I'm up against a lack of knowledge for this problem.

I have a file >500mb in text, which is nicely sectioned, but I know there are 5 to 10 "bad" sections inside. The data within the sections can be evaluated pretty easily by a human, I don't know how to do it in a program.

I pick up a known good value in #Field MyField - however if that value did not appear in #FIELD LOCATION, something went wrong.

An example of two sections within the file looks like this. The first is 'bad' and the second is 'good'.

#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END

Sections start and end logically, with #START and #END
If #FIELD LOCATION does not exist, go to next section
If #FIELD MyField="BAR" and #FIELD LOCATION does not contain BAR, print all lines from this section to a new file.
Note - Clarification of #FIELD MyField="BAR" - this is a check value I put in by grabbing other info about the data as this file is being built (in my case it is a language indicator, such as EN or DE. so it would literally be #FIELD MyField="EN" Any other value in this field would be ignored, this isn't a record that matches my criteria.

I believe this can be done in Awk or Perl, I can do very simple one-liners but this is beyond my skills.

458

asked Jan 23 '12 20:01

winndm

2 Answers

You could do something like below. It's just a rough draft, but it will work with your sample data. Use the flip-flop operator to find the start and end of records. Use a hash to store the field values, and an array to store the record.

I am simply checking if the value is in the location string, you might wish to further narrow the check by making sure it is in the correct place, or the correct case.

use strict;
use warnings;

my @record;
my %f;
while(<DATA>) {
    if (/^#START / .. /^#END */) {
        if (/^#START /) {
            @record = (); # reset
            %f = ();
        }
        push @record, $_;
        if (/^#END */) { # check and print
            if ($f{'LOCATION'} !~ /$f{'MyField'}/) {
                print @record; 
            }
        } else {         # add fields to hash
            if (/^#FIELD (.+)/) {
                            # use split with limit of 2 fields
                my ($key, $val) = split /=/, $1, 2;
                next unless $val; # no empty values
                $val =~ s/^"|"$//g; # strip quotes
                $f{$key} = $val;
            }
        }
    }
}

__DATA__
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION=http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END

answered Sep 28 '22 01:09

TLP

One-liner:

perl -ne 'BEGIN { $/ = "#END\n" }' -e '/MyField="(.*?)"/; print if !/Value=$1/' <file >newfile

Sets the Input Record Separator to "#END\n" so perl reads the 'chunks' into $_ one at a time, then captures the value in MyField and prints the whole chunk if Value=$1 (that is, that capture after 'Value=') is not present.

You may of course make the regexes more specific if needed.

answered Sep 28 '22 02:09

Josh Y.

Related questions
                            
                                Where can I read a clear explanation of POE (Perl Object Environment)?
                            
                                What, if any, are the disadvantages of SQL::Interp over SQL::Abstract?
                            
                                How do you create application-level options using Perl's App::Cmd?
                            
                                How can I find which jobs in a dependency tree can be run in parallel?
                            
                                How would I use a hash slice to initialize a hash stored in a data structure?
                            
                                Date formats for weeks
                            
                                Is there a good process monitoring and control framework in either Ruby or Perl?
                            
                                Adding values to a hash in Template Toolkit
                            
                                Understanding oAuth with Perl
                            
                                What good is -CSDA specified only on the shebang line?
                            
                                How use perl to process a file whose format is similar to unicode?
                            
                                Turing complete template engines [closed]
                            
                                merge multiple lines into single line by value of column
                            
                                Connecting to Teradata via Perl
                            
                                Perl: Are Special Variables Thread Safe?
                            
                                Extensible Provisioning Protocol implementations? [closed]
                            
                                Access website - WWW::Mechanize
                            
                                Regexp doesn't work for specific special characters in Perl
                            
                                Perl Authen::OATH and Google Authenticator - incompatible?
                            
                                Re-arranging a timestamp with a Perl regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With