Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sed/awk/perl remove the first two lines of a 3 line pattern

Tags:

sed

awk

ubuntu

perl

I have a huge text file. I need to replace all occurrences of this three line pattern:

|pattern|some data|
|giberish|,,
|pattern|some other data|

by the last line of the pattern:

|pattern|some other data|

remove the first two lines of the pattern, keep only the last one.

  • The second line of the pattern ends with two commas and does not start with |pattern|
  • The first line of the pattern line starts with |pattern| and does not end with two commas.
  • The third line of the pattern line starts with |pattern| and does not end with two commas.

I tried this:

sed 'N;N;/^|pattern|.*\n.*,,\n|pattern|.*/I,+1 d' trial.txt

with no much luck

Edit: Here is a more substantial example

#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
EOL

and it should become:

|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|

@zdim:

the first three lines of the file:

|pattern|sdkssd|
|.x,mz|e,dsa|,,
|pattern|sdk;sd|

satisfy the pattern. So they are replaced by

|pattern|sdk;sd|

so the top of the file now becomes:

|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|
|l'kk|3lke|,,
...

the first three lines of which are:

|pattern|sdk;sd|
|xl'x|cxm;s|,,
|pattern|aslkaa|

which satisfy the pattern, so they are replaced by:

|pattern|aslkaa|

so the top of the file now is:

|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
....

@JosephQuinsey:

consider this file:

#!/usr/bin/env bash
cat > trial.txt <<EOL
|pattern|blabla|
|||4|||-0.97|0|1429037262.8271||20160229||1025||1000.0|0.01|,,
|pattern|blable|
|||5|||-1.27|0|1429037262.854||20160229||1025||1000.0|0.01|,,
|pattern|blasbla|
|||493|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,,
|||11|||-0.22|5|1429037262.8676||20170228||1025||1000.0|0.01|,|T|347||1429043438.1962|-0.22|5|0||-0.22|1429043438.1962|,|Q|346||1429043437.713|-0.24|26|-0.22|5|||1429043437.713|
|pattern|jksds|
|||232|||-5.66|0|1429037262.817||20150415||1025||1000.0|0.01|,,
|pattern|bdjkds|
|||123q|||-7.15|0|1429037262.8271||20150415||1025||1000.0|0.01|,,
|pattern|blabla|
|||239ps|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,,
|||-92opa|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|1||1428969600.5019|-0.99|1|11||||,
|||kj2w|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|2||1428969600.5019|-1|1|11||||,
|||0293|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|3||1428969600.5019|-1.01|1|11||||,
|||2;;w32|||-1.38|79086|1429037262.8773||20150415||1025||1000.0|0.01|,|T|4||1428969600.5019|-1.11|1|11||||,
EOL
like image 496
user189035 Avatar asked Jan 26 '23 19:01

user189035


2 Answers

Here is a simple take on it, using a buffer to collect and manage the pattern-lines

use warnings;
use strict;
use feature 'say';

my $file = shift or die "Usage: $0 file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my @buf;

while (<$fh>) { 
    chomp;
    if (/^\|pattern\|/ and not /,,$/) { 
        @buf = $_;     # start the buffer (first line) or overwrite (third)
    }   
    elsif (/,,$/ and not /^\|pattern\|/) { 
        if  (@buf) { push @buf, $_ }  # add to buffer with first line in it
        else       { say }            # not part of 3-line-pattern; print
    }   
    else { 
        say for @buf;  # time to print out buffer
        @buf = ();     # ... empty it ...
        say            # and print the current line
    }   
}

This prints the expected output.

Explanation.

  • Pattern-lines go in a buffer, and when we get the "third line" the first two need be removed. Then "assign" to the array whenever we see ^|pattern| -- either to start the buffer if it's the first line or to re-initialize the array (removing what's in it) if it's the third line

  • A line ending with ,, is added to the buffer, if there is a line there already. Nothing prohibits lines ending with ,, just so -- they may be outside of a pattern; in that case just print it

  • So each |pattern| line sets the buffer straight -- either starts it or resets it. Thus once we run into a line with neither ^|pattern| nor ,,$ we can print out our buffer, and that line

Please test more comprehensively, what i still didn't get to do.


In order to run this either in a pipeline or on a file use the "magical" <> filehandle. So it becomes

use warnings;
use strict;
use feature 'say';

my @buf;

while (<>) {  # reads lines from files given on command line, or from STDIN
    ...
}

Now you can run it either as data | script.pl or as script.pl datafile. (Make the script executable for this, or use as perl script.pl.)

The script's output goes to STDOUT which can be piped into other programs or redirected to a file.

like image 176
zdim Avatar answered Jan 28 '23 08:01

zdim


It may depend on how your file is huge but if it is smaller than the allowed memory size, how about:

perl -0777 -pe '
    1 while s/^\|pattern\|.+?\|\n(?<!\|pattern\|).+?,,\n(\|pattern\|.+?\|)$/\1/m;
' trial.txt

Output:

|pattern|aslkaa|
|l'kk|3lke|,,
|x;;lkaa|c,c,s|
|-0-ses|3dsd|
|xk;xzz|'l3ld|
|0=9c09s|klkl32|
|d0-zox|m,3,a|
|x'.za|wkl;3|
|=-0poxz|3kls|
|x-]0';a|sd;ks|
|wsd|756|
|sdw|;lksd|
|pattern|askjkas|
|xp]o]xa|lk3j2|,,
|]-p[z|lks|
like image 20
tshiono Avatar answered Jan 28 '23 09:01

tshiono