awk åäö umlaut-chars has length of 2

Question

I'm using awk (mac os x) to print only lines that are n characters and longer.

If I try it on a text file (strings.txt) that looks like this:

four
foo
bar
föö
bår
fo
ba
fö
bå

And I run this awk script:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt

The output is:

four
foo
bar
föö
bår
fö
bå

(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.

(The input file is saved in UTF8 format.)

mklement0 · Accepted Answer

BSD awk (a.k.a BWK awk), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.

Your choices are:

IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use iconv as follows:
```
iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
```
Install a different awk implementation that is Unicode-aware, such as gawk (GNU Awk) or mawk; e.g., via Homebrew:
- brew info gawk
- brew info mawk
Use a different preinstalled tool that is Unicode-aware, such as sed:
```
sed -n '/^.\{3,\}/p' file
```

Dimitre Radoulov · Answer

Try setting your locale:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

Change en_US.UTF-8 to your correct locale.

Donate For Us