I have a large text file that contains a few unicode characters that make LaTeX crash. How can I find non-ASCII characters in a file with sed, and the like in a Linux bash?
To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .
In Notepad++, if you go to menu Search → Find characters in range → Non-ASCII Characters (128-255) you can then step through the document to each non-ASCII character. Be sure to tick off "Wrap around" if you want to loop in the document for all non-ASCII characters.
[3] On BSD, pipe the ls -q output through cat -v or od -c ( 25.7 ) to see what the non-printing characters are. This shows that the non-printing characters have octal values 13 and 14, respectively. If you look up these values in an ASCII table ( 51.3 ) , you will see that they correspond to CTRL-k and CTRL-l.
Try:
nonascii() { LANG=C grep --color=always '[^ -~]\+'; }
Which can be used like:
printf 'ŨTF8\n' | nonascii
Within []
^
means "not". So [^ -~]
means characters not between space and ~. So excluding control chars, this matches non ASCII characters, and is a more portable though slightly less accurate version of [^\x00-\x7f]
below. The \+
means 1 or more
and will get multibye characters to have a color shown around the complete character(s), rather than interspersed in each byte, thus corrupting the multibyte sequence
Try this command:
grep -P '[^\x00-\x7f]' file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With