Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find and delete files with non-ascii names

Tags:

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.

Example:

ls -l
-rwxrwxr-x 1 cws cws      0 Dec 28  2011 ??"??

ls -lb
-rwxrwxr-x 1 cws cws      0 Dec 28  2011 \a\211"\206\351

I would like to find all such files.

Here is an example screenshot of what I'm seeing when I do a ls in such folders:

enter image description here

I want to find these files with the non-printable characters and just delete them.

like image 385
Rohit Chopra Avatar asked Oct 02 '13 20:10

Rohit Chopra


2 Answers

Non-ASCII characters

ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character

is encoded in hex in UTF-8 as

E3 81 82

UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).

ASCII control characters

Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC (0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.

On my system, ls displays all control characters as ? by default, unless I pass the --show-control-chars option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.

Regular expressions for character codes

POSIX

POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):

[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)

PCRE

Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax

\x00

For example, a PCRE regex for the Japanese character would be

\xE3\x81\x82

In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F].

GNU's version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.

Finding the files

GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'

The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.

To delete the matching files, simply pass the -delete option to find, after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete

I highly recommend running the command without the -delete first, so you can see what will be deleted before it's too late.

If you also pass the -print option, you can see what is being deleted as the command runs:

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete

To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:

find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete

With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.


Update: Finding both non-ASCII and control characters

It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find to traverse our directory tree, but we'll pass the results to Perl for processing.

To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:

find . -print0 | perl -n0e 'print $_, "\n"'

Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD) is optional in GNU find but is required in BSD find on Mac OS X, so I've included it for the sake of portability.

Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):

[[:^ascii:][:cntrl:]]

The following command will print all paths (directories or files) in the current directory that match this regex:

find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'

The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:

find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'

This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).

like image 185
ThisSuitIsBlackNot Avatar answered Sep 21 '22 17:09

ThisSuitIsBlackNot


Based on this answer, try:

LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete

or:

LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete

Note: After files are printed right, remove the # character.

See also: How do I grep for all non-ASCII characters.

like image 28
kenorb Avatar answered Sep 20 '22 17:09

kenorb