I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .
Alternatively, you can also use regular expressions to find non-ASCII characters. ASCII character set is captured using regex [A-Za-z0-9]. You can use this regex in your query as shown below, to find non-ASCII characters. mysql> SELECT * FROM data WHERE full_name NOT REGEXP '[A-Za-z0-9]';
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented features.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With