Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk åäö umlaut-chars has length of 2

I'm using awk (mac os x) to print only lines that are n characters and longer.

If I try it on a text file (strings.txt) that looks like this:

four
foo
bar
föö
bår
fo
ba
fö
bå

And I run this awk script:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt 

The output is:

four
foo
bar
föö
bår
fö
bå

(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.

(The input file is saved in UTF8 format.)

like image 256
Superpanic Avatar asked Sep 28 '11 04:09

Superpanic


2 Answers

BSD awk (a.k.a BWK awk), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.

Your choices are:

  • IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use iconv as follows:

    iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
    
  • Install a different awk implementation that is Unicode-aware, such as gawk (GNU Awk) or mawk; e.g., via Homebrew:
    • brew info gawk
    • brew info mawk
  • Use a different preinstalled tool that is Unicode-aware, such as sed:

    sed -n '/^.\{3,\}/p' file
    
like image 176
mklement0 Avatar answered Nov 10 '22 21:11

mklement0


Try setting your locale:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

Change en_US.UTF-8 to your correct locale.

like image 3
Dimitre Radoulov Avatar answered Nov 10 '22 21:11

Dimitre Radoulov