How do I remove non-ASCII characters from a file?

If you want to use Perl, do it like this: <pre class="prettyprint"><code>perl -pi -e 's/[^[:ascii:]]//g' filename </code></pre> Detailed Explanation The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution... <ul> <li> <code>perl</code> run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt. </li> <li> <code>-p</code> The <code>-p</code> flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program in <code>while(<>) { /* program... */; } continue { print; }</code>. There's a similar <code>-n</code> flag that does the same but omits the <code>continue { print; }</code> block, so you'd use that if you wanted to do your own printing. </li> <li> <code>-i</code> The <code>-i</code> flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output to <code>STDOUT</code> which you can then redirect to a new file. Note that you cannot omit <code>-i</code> and redirect <code>STDOUT</code> to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The <code>-i</code> flag works around this intelligently. Perl and the shell allow you to combine multiple single character parameters into one which is why we can use <code>-pi</code> instead of <code>-p -i</code> The <code>-i</code> flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used <code>-i.bak</code>, then perl would copy the input file to <code>filename.bak</code> before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :) </li> <li> <code>-e</code> The <code>-e</code> flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility. Note that we cannot combine the <code>-e</code> flag with the <code>-i</code> flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used <code>-ie <program> <filename></code>, perl would assume <code><program></code> and <code><filename></code> are both input files and try to create <code><program>e</code> and <code><filename>e</code> assuming that <code>e</code> is the extension you want to use for the backup. This will fail as <code><program></code> is not really a file. The other way around (<code>-ei</code>) would also not work as perl would try to execute <code>i</code> as a program, which would fail compilation. </li> <li> <code>s/.../.../</code> This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of <code>$_</code>. The second and third are between the <code>/</code> symbols. The fourth is after the final <code>/</code> and is <code>g</code> in this case. <ul> <li><code>$_</code> In our code, the first argument is <code>$_</code> which is the default loop variable in perl. As mentioned above, the <code>-p</code> flag wraps our program in <code>while(<>)</code>, which creates a <code>while</code> loop that reads one line at a time (<code><></code>) from the input. It implicitly assigns this line to <code>$_</code>, and all commands that take in a single argument will use this if not specified (eg: just calling <code>print;</code> will actually translate to <code>print $_;</code>). So, in our code, the <code>s/.../.../</code> operator operates once on each line of the input file.</li> <li><code>[^[:ascii:]]</code> The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within <code>[]</code> is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.</li> <li><code><empty string></code> The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.</li> <li><code>g</code> The fourth argument is a modifier flag for the substitution operator. The <code>g</code> flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags are <code>i</code> for case insensitive matches, <code>s</code> and <code>m</code> which are only relevant for multi-line strings (we have single line strings here), <code>o</code> which specifies that the pattern should be precompiled (which could be useful here for long files), and <code>x</code> which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).</li> </ul> </li> <li> <code>filename</code> This is the input file that contains non-ascii characters that we'd like to strip out. </li> </ul> <code>[^[:ascii:]]</code> So now let's discuss <code>[^[:ascii:]]</code> in more detail. As mentioned above, <code>[]</code> in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, <code>[abc]</code> will match either an <code>a</code>, or a <code>b</code> or a <code>c</code>, and it will match only a single character. Using <code>^</code> as the first character inverts the match, so <code>[^abc]</code> will match any one character that is not an <code>a</code>, <code>b</code>, or <code>c</code>. But what about <code>[:ascii:]</code> inside the bracket expression? If you have a unix based system available, run <code>man 7 re_format</code> at the command line to read the man page. If not, read the online version <code>[:ascii:]</code> is a character class that represents the entire set of <code>ascii</code> characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is <code>[[:ascii:]]</code> and it may be negated as with the <code>abc</code> case above or combined within a bracket expression with other characters, so, for example, <code>[éç[:ascii:]]</code> will match all ascii characters and also <code>é</code> and <code>ç</code> which are not ascii, and <code>[^éç[:ascii:]]</code> will match all characters that are not ascii and also not <code>é</code> or <code>ç</code>.

<pre class="prettyprint"><code>tr -dc [:graph:][:cntrl:] < input-file > cleaned-file </code></pre> That's assuming you want to retain "control" characters and "printable" characters. Fiddle as required.

<pre class="prettyprint"><code>perl -pe's/[[:^ascii:]]//g' < input.txt > output.txt </code></pre>

My two cents: It might not solve your problem, but it may give you some hints. The <code>file</code> command tells you file encoding, i.e. UTF, ASCII, etc. and <code>iconv</code> can convert a file between different encodings.

Remove non-ASCII characters in a file [duplicate]

5 Answers

If you want to use Perl, do it like this:

perl -pi -e 's/[^[:ascii:]]//g' filename

Detailed Explanation

The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...

perl

run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.
-p

The -p flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program in while(<>) { /* program... */; } continue { print; }. There's a similar -n flag that does the same but omits the continue { print; } block, so you'd use that if you wanted to do your own printing.
-i

The -i flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output to STDOUT which you can then redirect to a new file.

Note that you cannot omit -i and redirect STDOUT to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The -i flag works around this intelligently.

Perl and the shell allow you to combine multiple single character parameters into one which is why we can use -pi instead of -p -i

The -i flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used -i.bak, then perl would copy the input file to filename.bak before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)
-e

The -e flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.

Note that we cannot combine the -e flag with the -i flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used -ie <program> <filename>, perl would assume <program> and <filename> are both input files and try to create <program>e and <filename>e assuming that e is the extension you want to use for the backup. This will fail as <program> is not really a file. The other way around (-ei) would also not work as perl would try to execute i as a program, which would fail compilation.
s/.../.../

This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of $_. The second and third are between the / symbols. The fourth is after the final / and is g in this case.
- $_ In our code, the first argument is $_ which is the default loop variable in perl. As mentioned above, the -p flag wraps our program in while(<>), which creates a while loop that reads one line at a time (<>) from the input. It implicitly assigns this line to $_, and all commands that take in a single argument will use this if not specified (eg: just calling print; will actually translate to print $_;). So, in our code, the s/.../.../ operator operates once on each line of the input file.
- [^[:ascii:]] The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within [] is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.
- <empty string> The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.
- g The fourth argument is a modifier flag for the substitution operator. The g flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags are i for case insensitive matches, s and m which are only relevant for multi-line strings (we have single line strings here), o which specifies that the pattern should be precompiled (which could be useful here for long files), and x which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).
filename

This is the input file that contains non-ascii characters that we'd like to strip out.

[^[:ascii:]]

So now let's discuss [^[:ascii:]] in more detail.

As mentioned above, [] in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc] will match either an a, or a b or a c, and it will match only a single character. Using ^ as the first character inverts the match, so [^abc] will match any one character that is not an a, b, or c.

But what about [:ascii:] inside the bracket expression?

If you have a unix based system available, run man 7 re_format at the command line to read the man page. If not, read the online version

[:ascii:] is a character class that represents the entire set of ascii characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]] and it may be negated as with the abc case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]] will match all ascii characters and also é and ç which are not ascii, and [^éç[:ascii:]] will match all characters that are not ascii and also not é or ç.

answered Oct 07 '22 17:10

bluesmoon

tr -dc [:graph:][:cntrl:] < input-file > cleaned-file

That's assuming you want to retain "control" characters and "printable" characters. Fiddle as required.

answered Oct 07 '22 17:10

Carl Smotricz

perl -pe's/[[:^ascii:]]//g' < input.txt > output.txt

answered Oct 07 '22 16:10

Thomas

You can write a C program like this:

#include <stdio.h>
#include <ctype.h>

int main(int argc, char **argv)
{
   FILE *fin = fopen("source_file", "rb");
   FILE *fout = fopen("target_file", "w");
   int c;
   while ((c = fgetc(fin)) != EOF) {
       if (isprint(c))
          fputc(c, fout);
   }
   fclose(fin);
   fclose(fout);
   return 0;
}

Note: error checks were avoided for simplicity.

Compile it with:

$ gcc -W source_code.c -o convert

Run it with:

$ ./convert

answered Oct 07 '22 15:10

Pablo Santa Cruz

My two cents: It might not solve your problem, but it may give you some hints.

The file command tells you file encoding, i.e. UTF, ASCII, etc. and iconv can convert a file between different encodings.

answered Oct 07 '22 17:10

Nikhil S

Related questions
                            
                                Understanding the UNIX command xargs
                            
                                Get free disk space with df to just display free space in kb?
                            
                                remove all of a file type from a directory and its children
                            
                                When could or should I use chmod g+s on a file or directory?
                            
                                How to change the owner for a rsync
                            
                                What is the difference between AF_INET and PF_INET constants?
                            
                                Using absolute unix paths in windows with python
                            
                                How to use "cmp" to compare two binaries and find all the byte offsets where they differ?
                            
                                In-place processing with grep
                            
                                Difference between shell and environment variables
                            
                                What does the 2> mean on the Unix command-line? [closed]
                            
                                Where does '.' and '..' come from?
                            
                                Is it OK to use the same input file as output of a piped command?
                            
                                How to get parameters from config file in R script
                            
                                Using sed to remove a block of text
                            
                                Re-opening stdout and stdin file descriptors after closing them
                            
                                Compress multiple files individually with Gzip
                            
                                Is there any mutex/semaphore mechanism in shell scripts?
                            
                                Use shebang/hashbang in Windows Command Prompt
                            
                                Syntax error near unexpected token 'then'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove non-ASCII characters in a file [duplicate]

Tags:

unix

janar

People also ask

5 Answers

bluesmoon

Carl Smotricz

Thomas

Pablo Santa Cruz

Nikhil S

Recent Activity

Donate For Us