Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a large file and output sections matching multiple parameters

Tags:

awk

perl

I rarely have to deal with scripting, so I'm up against a lack of knowledge for this problem.

I have a file >500mb in text, which is nicely sectioned, but I know there are 5 to 10 "bad" sections inside. The data within the sections can be evaluated pretty easily by a human, I don't know how to do it in a program.

I pick up a known good value in #Field MyField - however if that value did not appear in #FIELD LOCATION, something went wrong.

An example of two sections within the file looks like this. The first is 'bad' and the second is 'good'.

#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
  1. Sections start and end logically, with #START and #END

  2. If #FIELD LOCATION does not exist, go to next section

  3. If #FIELD MyField="BAR" and #FIELD LOCATION does not contain BAR, print all lines from this section to a new file.

  4. Note - Clarification of #FIELD MyField="BAR" - this is a check value I put in by grabbing other info about the data as this file is being built (in my case it is a language indicator, such as EN or DE. so it would literally be #FIELD MyField="EN" Any other value in this field would be ignored, this isn't a record that matches my criteria.

I believe this can be done in Awk or Perl, I can do very simple one-liners but this is beyond my skills.

like image 458
winndm Avatar asked Jan 23 '12 20:01

winndm


People also ask

How do I read a 100gb file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

Which method allows you to read a file's entire content at once?

Method 1: Read a File Line by Line using readlines() readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.

How do you split a large file into smaller parts in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.

How do you parse a large text file in Python?

To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.


2 Answers

You could do something like below. It's just a rough draft, but it will work with your sample data. Use the flip-flop operator to find the start and end of records. Use a hash to store the field values, and an array to store the record.

I am simply checking if the value is in the location string, you might wish to further narrow the check by making sure it is in the correct place, or the correct case.

use strict;
use warnings;

my @record;
my %f;
while(<DATA>) {
    if (/^#START / .. /^#END */) {
        if (/^#START /) {
            @record = (); # reset
            %f = ();
        }
        push @record, $_;
        if (/^#END */) { # check and print
            if ($f{'LOCATION'} !~ /$f{'MyField'}/) {
                print @record; 
            }
        } else {         # add fields to hash
            if (/^#FIELD (.+)/) {
                            # use split with limit of 2 fields
                my ($key, $val) = split /=/, $1, 2;
                next unless $val; # no empty values
                $val =~ s/^"|"$//g; # strip quotes
                $f{$key} = $val;
            }
        }
    }
}

__DATA__
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION=http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
like image 75
TLP Avatar answered Sep 28 '22 01:09

TLP


One-liner:

perl -ne 'BEGIN { $/ = "#END\n" }' -e '/MyField="(.*?)"/; print if !/Value=$1/' <file >newfile

Sets the Input Record Separator to "#END\n" so perl reads the 'chunks' into $_ one at a time, then captures the value in MyField and prints the whole chunk if Value=$1 (that is, that capture after 'Value=') is not present.

You may of course make the regexes more specific if needed.

like image 35
Josh Y. Avatar answered Sep 28 '22 02:09

Josh Y.