Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine lines with matching first field

For a few years, I often have a need to combine lines of (sorted) text with a matching first field, and I never found an elegant (i.e. one-liner unix command line) way to do it. What I want is similar to what's possible with the unix join command, but join expects 2 files, with each key appearing maximum once. I want to start with a single file, in which a key might appear multiple tiles.

I have both a ruby and perl script that do this, but there's no way to shorten my algorithm into a one-liner. After years of unix use, I'm still learning new tricks with comm, paste, uniq, etc, and I suspect there's a smart way to do this.

There are some related questions, like join all lines that have the same first column to the same line; Command line to match lines with matching first field (sed, awk, etc.); and Combine lines with matching keys -- but those solutions never really give a clean and reliable solution.

Here's sample input:

apple:A fruit
apple:Type of: pie
banana:tropical fruit
cherry:small burgundy fruit
cherry:1 for me to eat
cherry:bright red

Here's sample output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

Here's my ideal syntax:

merge --inputDelimiter=":" --outputDelimiter=";" --matchfield=1 infile.txt

The "matchfield" is really optional. It could always be the first field. Subsequent appearances of the delimiter should be treated like plain text.

I don't mind a perl, ruby, awk one-liner, if you can think of a short and elegant algorithm. This should be able to handle millions of lines of input. Any ideas?

like image 577
Michael Douma Avatar asked Oct 13 '17 16:10

Michael Douma


1 Answers

Using awk one liner

awk -F: -v ORS="" 'a!=$1{a=$1; $0=RS $0} a==$1{ sub($1":",";") } 1' file

Output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

setting ORS="" ; By default it is \n.
The reason why we have set ORS="" (Output Record Separator) is because we don't want awk to include newlines in the output at the end of each record. We want to handle it in our own way, through our own logic. We are actually including newlines at the start of every record which has the first field different from the previous one.

a!=$1 : When variable a (initially null) doesn't match with first field $1 which is for eg. applein first line, then set a=$1 and $0=RS $0 i.e $0 or simply whole record becomes "\n"$0 (basically adding newline at the beginning of record). a!=$1 will always satisfy when there is a different first field ($1) than the previous line's $1 and is thus a criteria to segregate our records based on first field.

a==$1: If it matches then it probably means you are iterating over a record belonging to the previous record set. In this case substitute first occurrence of $1: (Note the : ) for eg. apple: with ;. $1":" could also be written as $1FS where FS is :

If you have millions of line in your file then this approach would be fastest because it doesn't involve any pre-processing and also we are not using any other data structure say array for storing your keys or records.

like image 91
Rahul Verma Avatar answered Sep 30 '22 09:09

Rahul Verma