i have in a txt file, date like:
yyyymmdd
raw data are like:
20171115
20171115
20180903
...
20201231
They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.
I guess this must be a two steps process:
sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"
How could i achieve it using awk?
I even tried with perl no luck though, so a perl one liner would be highly accepted as well.
Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.
example with perl
@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
$a =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$b =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$c <=> $d;
}
print "@ordered\n";
This is an answer using perl. If you want the oldest on top, you can use the standard sort order:
@dates = sort @dates;
Reverse sort order, with the newest on top:
@dates = sort { $b <=> $a } @dates;
# ^^^
# |
# numerical three-way comparison returning -1, 0 or +1
You can then extract 10000 of the entries from the top:
my $keep = 10000;
my @top = splice @dates, 0, $keep;
And 10000 from the bottom:
$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;
@dates
will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.
You can then save the two arrays to files if you want:
sub save {
my $filename=shift;
open my $fh, '>', $filename or die "$filename: $!";
print $fh join("\n", @_) . "\n" if(@_);
close $fh;
}
save('top', @top);
save('bottom', @bottom);
A command-line script ("one"-liner) with Perl
perl -MPath::Tiny=path -we'
$f = shift; $n = shift//2; # filename; number of lines or default
@d = sort +(path($f)->lines); # sort lexicographically, ascending
$n = int @d/2 if 2*$n > @d; # top/bottom lines, up to half of file
path("bottom.txt")->spew(@d[0..$n-1]); # write files, top/bottom $n lines
path("top.txt") ->spew(@d[$#d-$n+1..$#d])
' dates.txt 4
Comments
Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4
is passed (with default 2
), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that
For the library (-MPath::Tiny
) I specify the method name (=path
) only for documentation; this isn't necessary since the libary is a class, so that =path
may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } @d;
. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n
). If there isn't then that's set to half the file
The syntax $#ary
is the last index of the array @ary
and that is used to count off $n
items from the back of the array with lines @d
This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With