Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using multicharacter field separator using AWK

Tags:

awk

I'm having problems with AWK's field delimiter, the input file appears as below

1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria | scientific name |
2 | Monera | Monera | in-part |
2 | Procaryotae | Procaryotae | in-part |
2 | Prokaryota | Prokaryota | in-part |
2 | Prokaryotae | Prokaryotae | in-part |
2 | bacteria | bacteria | blast name |

the field delimiter here is tab,pipe,tab \t|\t so in my attempt to print just the 1st and 2nd column

awk -F'\t|\t' '{print $1 "\t" $2}' nodes.dmp | less

instead of the desired output, the output is the 1st column followed by the pipe character. I tried escaping the pipe \t\|\t, but the output remains the same.

1 |
1 |
2 |
2 |
2 |
2 |

Printing the 1st and 3rd column gave me the original intended output.

awk -F'\t|\t' '{print $1 "\t" $3}' nodes.dmp | less

but i'm puzzed as to why this is not working as intended.

I understand that the perl one liner below will work but what i really want is to use awk.

perl -aln -F"\t\|\t" -e 'print $F[0],"\t",$F[1]' nodes.dmp | less
like image 204
Buthetleon Avatar asked Mar 23 '23 19:03

Buthetleon


1 Answers

The pipe | character seems to be confusing awk into thinking that \t|\t implies that the field separator could be one of \t or \t. Tell awk to interpret the | literally.

$ awk -F'\t[|]\t' '{print $1 "\t" $2}'
1   all
1   root
2   Bacteria
2   Monera
2   Procaryotae
2   Prokaryota
2   Prokaryotae
2   bacteria
like image 189
devnull Avatar answered Mar 29 '23 21:03

devnull