Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case insensitive string matching in awk

Assume a multi-line text file file in which some lines start with whitespaces.

$ cat file
foo Baz
  baz QUX
    QUx Quux
BaZ Qux
BazaaR

Further assume that I wish to convert all those lines that start with a keyword (e.g. "baz") to lowercase letters, irrespective if (a) that keyword is written in lower- or uppercase letters (or any combination thereof) itself, and (b) that keyword is preceeded by whitespaces.

$ cat file | sought_command
foo Baz        # not to lowercase (line does not start with keyword)
  baz qux      # to lowercase
    QUx Quux
baz qux        # to lowercase
BazaaR         # not to lowercase (line does not start with keyword, but merely with a word containing the keyword)

I believe that awk is the tool to do it, but I am uncertain how to implement the case-insensitivity for the keyword matching.

$ cat file | awk '{ if($1 ~ /^ *baz/) print tolower($0); else print $0}'
foo Baz
  baz qux
    QUx Quux
BaZ Qux       # ERROR HERE: was not replaced, b/c keyword not recognized.
BazaaR

EDIT 1: Adding IGNORECASE=1 appears to resolve the case-insensitivity, but now incorrectly converts the last line to lowercase.

$ cat file | awk '{IGNORECASE=1; if($1~/^ *baz/) print tolower($0); else print $0}'
foo Baz
  baz qux
    QUx Quux
baz qux
bazaar       # ERROR HERE: should not be converted to lowercase, as keyword not present (emphasis on word!).
like image 457
Michael G Avatar asked Jul 05 '17 15:07

Michael G


2 Answers

You already know about tolower() so just use it again in the comparison and test for an exact string match instead of partial regexp:

awk 'tolower($1)=="baz"{$0=tolower($0)}1'
like image 107
Ed Morton Avatar answered Sep 24 '22 12:09

Ed Morton


Add word-boundary after search string

$ awk '{IGNORECASE=1; if($1~/^ *baz\>/) print tolower($0); else print $0}' ip.txt 
foo Baz
  baz qux
    QUx Quux
baz qux
BazaaR

Can be re-written as:

awk 'BEGIN{IGNORECASE=1} /^ *baz\>/{$0=tolower($0)} 1' ip.txt 

Since line anchor is used, no need to match with $1. The 1 at end will print the record, including any changes done

IGNORECASE and \> are gawk specific features. \y can be also used to match word boundary


With GNU sed

$ sed 's/^[[:blank:]]*baz\b.*/\L&/I' ip.txt 
foo Baz
  baz qux
    QUx Quux
baz qux
BazaaR
  • [[:blank:]] will match space or tab characters
  • \L& will lowercase the line
  • \b is word boundary
  • I flag to match case-insensitively
like image 31
Sundeep Avatar answered Sep 24 '22 12:09

Sundeep