How to remove the second line of consecutive lines starting with the same word?

Question

I have a text file with interchanging lines starting with 'TITLE' and 'DATA' but sometimes there are duplicate lines starting with 'TITLE':

TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE some more
TITLE extra info
DATA some more data

I'd like to be able to detect the duplicate lines starting with 'TITLE' and keep only the first line of each such pair.
I figured out that the regular expression for capturing these is ^TITLE.* ^TITLE.* now I'd like to incorporate this into a one-liner perl/bash/sed/awk command that would remove the second line and output the rest of the file, but I couldn't figure this out.

ghoti · Accepted Answer

It sounds to me like you have records that consist of two fields, TITLE and DATA, and that if you're missing the second field, you want to drop the record. But that's not what you asked in your question. So here's one way to do what you asked:

awk '/^TITLE/&&!t{t=$0} /^DATA/&&t{print t;print;t=""}' inputfile

The idea here is that we'll set a variable to a TITLE when we see it and don't already have a titled set, then only print it when we see a DATA. This works for the input data you provided, if I'm reading your question right. Output is:

TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE some more
DATA some more data

As you can see, the last TITLE line in your dataset was dropped.

And here's another way to do this in awk...

awk '/^TITLE/&&t{next} t=0; /^TITLE/{t=1} 1' inputfile

In this one, the first expression skips titles if t has ben set. The second expression unsets t. Third expression sets if for titles, and the last expression (1) prints the line. Of course, the last three expressions don't get run if we skipped the line in the first expression. It generates the same output as above, and doesn't bother looking at /^DATA/.

Finally, this one is the least code but the oddest logic:

awk '/^DATA/ || !t; {t=/^TITLE/}' inputfile

It prints all data lines, or any line where t isn't set, then effectively sets t to a boolean, affecting the next line's evaluation. If you're doing this in csh or tcsh, beware of the exclamation point, which in those shells may need to be escaped.

choroba · Answer

Perl solution:

perl -ne 'print unless $t and /^TITLE/; $t = /^TITLE/'

It remembers whether the previous line was a TITLE in the $t variable.

How to remove the second line of consecutive lines starting with the same word?

Tags:

regex

bash

sed

awk

perl

Roey Angel

2 Answers

ghoti

choroba

Recent Activity

Donate For Us

How to remove the second line of consecutive lines starting with the same word?

Tags:

regex

bash

sed

awk

perl

Roey Angel

2 Answers

ghoti

choroba

Related questions

Recent Activity

Donate For Us