Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

like image 729
Amandasaurus Avatar asked Jan 23 '10 17:01

Amandasaurus


People also ask

How do I find a non Unicode character?

To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .


2 Answers

This will match a single non-ASCII character:

[^\x00-\x7F] 

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

  • [[:ascii:]] - matches a single ASCII char
  • [^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

like image 53
Alix Axel Avatar answered Oct 27 '22 05:10

Alix Axel


No, [^\x20-\x7E] is not ASCII.

This is real ASCII:

 [^\x00-\x7F] 

Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!

like image 40
Peter L Avatar answered Oct 27 '22 05:10

Peter L