I rarely have to deal with scripting, so I'm up against a lack of knowledge for this problem.
I have a file >500mb in text, which is nicely sectioned, but I know there are 5 to 10 "bad" sections inside. The data within the sections can be evaluated pretty easily by a human, I don't know how to do it in a program.
I pick up a known good value in #Field MyField
- however if that value did not appear in #FIELD LOCATION
, something went wrong.
An example of two sections within the file looks like this. The first is 'bad' and the second is 'good'.
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
Sections start and end logically, with #START
and #END
If #FIELD LOCATION
does not exist, go to next section
If #FIELD MyField="BAR"
and #FIELD LOCATION
does not contain BAR
, print all lines from this section to a new file.
Note - Clarification of #FIELD MyField="BAR"
- this is a check value I put in by grabbing other info about the data as this file is being built (in my case it is a language indicator, such as EN or DE. so it would literally be #FIELD MyField="EN"
Any other value in this field would be ignored, this isn't a record that matches my criteria.
I believe this can be done in Awk or Perl, I can do very simple one-liners but this is beyond my skills.
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
Method 1: Read a File Line by Line using readlines() readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.
To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.
To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.
You could do something like below. It's just a rough draft, but it will work with your sample data. Use the flip-flop operator to find the start and end of records. Use a hash to store the field values, and an array to store the record.
I am simply checking if the value is in the location string, you might wish to further narrow the check by making sure it is in the correct place, or the correct case.
use strict;
use warnings;
my @record;
my %f;
while(<DATA>) {
if (/^#START / .. /^#END */) {
if (/^#START /) {
@record = (); # reset
%f = ();
}
push @record, $_;
if (/^#END */) { # check and print
if ($f{'LOCATION'} !~ /$f{'MyField'}/) {
print @record;
}
} else { # add fields to hash
if (/^#FIELD (.+)/) {
# use split with limit of 2 fields
my ($key, $val) = split /=/, $1, 2;
next unless $val; # no empty values
$val =~ s/^"|"$//g; # strip quotes
$f{$key} = $val;
}
}
}
}
__DATA__
#START Descriptor
#FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
#START Descriptor
#FIELD LOCATION=http://path.to/file/here&Value=BAR&OtherValue=BLAH"
#FIELD AnythingElse
#FIELD MyField="BAR"
#END
One-liner:
perl -ne 'BEGIN { $/ = "#END\n" }' -e '/MyField="(.*?)"/; print if !/Value=$1/' <file >newfile
Sets the Input Record Separator to "#END\n"
so perl reads the 'chunks' into $_
one at a time, then captures the value in MyField and prints the whole chunk if Value=$1
(that is, that capture after 'Value=') is not present.
You may of course make the regexes more specific if needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With