Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding non-Ascii character [duplicate]

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:

grep -e "[\x{00FF}-\x{FFFF}]" file.xml

But this returns every line in the file, regardless of whether the line contains a character in the range specified.

Do I have the syntax wrong or am I doing something else wrong? I've also tried:

egrep "[\x{00FF}-\x{FFFF}]" file.xml 

(with both single and double quotes surrounding the pattern).

like image 285
pconrey Avatar asked Jun 08 '10 20:06

pconrey


People also ask

How do I find a non Unicode character?

To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .

How do I find non-ASCII characters in Excel?

There is no ActiveDocument object in Excel, which you can check by opening the VBE (ALT+F11) then opening the Object Browser (F2) and try a search.


2 Answers

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented features.

like image 196
jerrymouse Avatar answered Oct 02 '22 08:10

jerrymouse


Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.

So the first solution for instance would become:

grep --color='auto' -P -n '[^\x00-\x7F]' file.xml

(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)

On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:

pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml

Any pros or cons that anyone can think off?

like image 21
pvandenberk Avatar answered Oct 02 '22 09:10

pvandenberk