Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting shell script to utf8

I want to write to following command line into a shell script:

cat text.tsv |
grep -Pvi '.\t.\t.*\bHotels|Гостиница|Готель|Отель|Хотел|ホテル|מלון|فندق|होटल|โรงแรม|숙박|호텔|宾馆|旅店|旅馆|酒店|飯店\b' |
awk '{print $0,"\t","column1"} > Text2.tsv

However when I put this into a .sh file all non ascii characters get ommited:

cat text.tsv |
grep -Pvi '.\t.\t.*\bHotels|?????????|??????|?????|?????|???|????|????|????|??????|??|??|??|??|??|??|??\b' |
awk '{print $0,"\t","True"} > Text2.tsv

How do I set my .sh file into UTF-8? I tried:

iconv -c -f ASCII -t UTF-8 Test.sh > Test2.sh 

But this doesn't seem to work.

like image 349
Vic23 Avatar asked Feb 12 '15 12:02

Vic23


2 Answers

Bash takes care of your locale settings.

Check it with locale

If not in UTF-8, you do like this:

export LANG=C.UTF-8
like image 174
jeremf Avatar answered Oct 31 '22 06:10

jeremf


The script itself should be in UTF-8. You need to make sure your locale and your Bash settings are set up properly (really old versions of Bash would need to be explicitly configured to pass through 8-bit data, etc; but this should be a matter of ancient history on any reasonably modern platform). Basically, this should Just Work.

There are many things which could be wrong, though. Is the script file properly in UTF-8? The file Test2.sh almost certainly isn't, and you should have received warnings from iconv if the input in Test.sh was correctly formatted, so we vaguely speculate that you have used some other encoding in this file, which would explain why things don't work.

Also, your Awk script seems to be missing a closing single quote at the end.

Finally, anything which looks like grep | awk can usually be refactored more elegantly into just an Awk script. Get rid of the Useless cat while you're at it.

awk 'tolower($0) !~ /.\t.\t.*\<(Hotels|Гостиница|Готель|Отель|Хотел|ホテル|מלון|فندق|होटल|โรงแรม|숙박|호텔|宾馆|旅店|旅馆|酒店|飯店)\>/{
print $0,"\t","column1"}' test.tsv > Text2.tsv

I assume your regex was missing a pair of parentheses around the hotel phrases. Awk doesn't recognize \b but \< / \> means the same thing.

If the intent is to look for these phrases in the third column of a tab-separated text file, use -F '\t' and examine $3 directly.

like image 35
tripleee Avatar answered Oct 31 '22 04:10

tripleee