Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Unicode characters from textfiles - sed , other Bash/shell methods

How do I remove Unicode characters from a bunch of text files in the terminal?

I've tried this, but it didn't work:

sed 'g/\u'U+200E'//' -i *.txt

I need to remove these Unicode characters from the text files:

U+0091 - sort of weird "control" space
U+0092 - same sort of weird "control" space
A0 - non-space break
U+200E - left to right mark
like image 288
alvas Avatar asked Dec 19 '11 13:12

alvas


3 Answers

Clear all non-ASCII characters of file.txt:

$ iconv -c -f utf-8 -t ascii file.txt
$ strings file.txt

Options:

-c # discard unconvertible characters
-f # from ENCODING
-t # to ENCODING
like image 129
kev Avatar answered Oct 17 '22 15:10

kev


If you want to remove only particular characters and you have Python, you can:

CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")')
sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt
like image 38
Michał Šrajer Avatar answered Oct 17 '22 15:10

Michał Šrajer


For UTF-8 encoding of Unicode, you can use this regular expression for sed:

sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g'
like image 34
choroba Avatar answered Oct 17 '22 15:10

choroba