I'm using awk (mac os x) to print only lines that are n characters and longer.
If I try it on a text file (strings.txt) that looks like this:
four
foo
bar
föö
bår
fo
ba
fö
bå
And I run this awk script:
awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt
The output is:
four
foo
bar
föö
bår
fö
bå
(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.
(The input file is saved in UTF8 format.)
BSD awk
(a.k.a BWK awk
), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.
Your choices are:
IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use iconv
as follows:
iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
awk
implementation that is Unicode-aware, such as gawk
(GNU Awk) or mawk
; e.g., via Homebrew:
brew info gawk
brew info mawk
Use a different preinstalled tool that is Unicode-aware, such as sed
:
sed -n '/^.\{3,\}/p' file
Try setting your locale:
LC_ALL=en_US.UTF-8 awk 'length >= 3' infile
Change en_US.UTF-8 to your correct locale.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With