I have a huge text file. I need to replace all occurrences of this three line pattern:
|pattern|some data|
|giberish|,,
|pattern|some other data|
by the last line of the pattern:
|pattern|some other data|
remove the first two lines of the pattern, keep only the last one.
|pattern|
|pattern|
and does not end with two commas.|pattern|
and does not end with two commas.I tried this:
sed 'N;N;/^|pattern|.*\n.*,,\n|pattern|.*/I,+1 d' trial.txt
with no much luck
Edit: Here is a more substantial example
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
EOL
and it should become:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
@zdim:
the first three lines of the file:
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
satisfy the pattern. So they are replaced by
|pattern|sdk;sd|
so the top of the file now becomes:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
...
the first three lines of which are:
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
which satisfy the pattern, so they are replaced by:
|pattern|aslkaa|
so the top of the file now is:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
....
@JosephQuinsey:
consider this file:
#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|blabla|
|||4|||-0.97|0|1429037262.8271||20160229||1025||1000.0|0.01|,,
|pattern|blable|
|||5|||-1.27|0|1429037262.854||20160229||1025||1000.0|0.01|,,
|pattern|blasbla|
|||493|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,,
|||11|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,|T|347||1429043438.1962|-0.22|5|0||-0.22|1429043438.1962|,|Q|346||1429043437.713|-0.24|26|-0.22|5|||1429043437.713|
|pattern|jksds|
|||232|||-5.66|0|1429037262.817||20150415||1025||1000.0|0.01|,,
|pattern|bdjkds|
|||123q|||-7.15|0|1429037262.8271||20150415||1025||1000.0|0.01|,,
|pattern|blabla|
|||239ps|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,,
|||-92opa|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|1||1428969600.5019|-0.99|1|11||||,
|||kj2w|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|2||1428969600.5019|-1|1|11||||,
|||0293|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|3||1428969600.5019|-1.01|1|11||||,
|||2;;w32|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|4||1428969600.5019|-1.11|1|11||||,
EOL
Here is a simple take on it, using a buffer to collect and manage the pattern-lines
use warnings;
use strict;
use feature 'say';
my $file = shift or die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my @buf;
while (<$fh>) {
chomp;
if (/^\|pattern\|/ and not /,,$/) {
@buf = $_; # start the buffer (first line) or overwrite (third)
}
elsif (/,,$/ and not /^\|pattern\|/) {
if (@buf) { push @buf, $_ } # add to buffer with first line in it
else { say } # not part of 3-line-pattern; print
}
else {
say for @buf; # time to print out buffer
@buf = (); # ... empty it ...
say # and print the current line
}
}
This prints the expected output.
Explanation.
Pattern-lines go in a buffer, and when we get the "third line" the first two need be removed. Then "assign" to the array whenever we see ^|pattern|
-- either to start the buffer if it's the first line or to re-initialize the array (removing what's in it) if it's the third line
A line ending with ,,
is added to the buffer, if there is a line there already. Nothing prohibits lines ending with ,,
just so -- they may be outside of a pattern; in that case just print it
So each |pattern|
line sets the buffer straight -- either starts it or resets it. Thus once we run into a line with neither ^|pattern|
nor ,,$
we can print out our buffer, and that line
Please test more comprehensively, what i still didn't get to do.
In order to run this either in a pipeline or on a file use the "magical" <>
filehandle. So it becomes
use warnings;
use strict;
use feature 'say';
my @buf;
while (<>) { # reads lines from files given on command line, or from STDIN
...
}
Now you can run it either as data | script.pl
or as script.pl datafile
. (Make the script executable for this, or use as perl script.pl
.)
The script's output goes to STDOUT
which can be piped into other programs or redirected to a file.
It may depend on how your file is huge but if it is smaller than the allowed memory size, how about:
perl -0777 -pe '
1 while s/^\|pattern\|.+?\|\n(?<!\|pattern\|).+?,,\n(\|pattern\|.+?\|)$/\1/m;
' trial.txt
Output:
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With