I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the <code>recode</code> utility for that. How can I specify that the <code>recode</code> utility should only convert windows-1252 encoded files and not the UTF-8 files? Example usage of recode: <pre class="prettyprint"><code>recode windows-1252.. myfile.txt </code></pre> This would convert <code>myfile.txt</code> from windows-1252 to UTF-8. Before doing this, I would like to know that <code>myfile.txt</code> is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

<code>iconv -f WINDOWS-1252 -t UTF-8 filename.txt</code>

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character. Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive. One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive. I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical. Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example. Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

Windows-1252 to UTF-8 encoding

Tags:

character-encoding

encoding

utf-8

windows-1252

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

661

asked Jan 06 '10 15:01

Sam

2 Answers

iconv -f WINDOWS-1252 -t UTF-8 filename.txt

122

answered Sep 23 '22 18:09

Gregory Pakosz

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

answered Sep 23 '22 18:09

Jon Skeet

Related questions
                            
                                Server.UrlEncode vs Uri.EscapeDataString
                            
                                What is the encoding of argv?
                            
                                how do I check the encoding of a file in visual studio 2010?
                            
                                How to handle user input of invalid UTF-8 characters?
                            
                                What is character encoding and why should I bother with it
                            
                                Equivalent Javascript Functions for Python's urllib.quote() and urllib.unquote()
                            
                                The origin on why '%20' is used as a space in URLs
                            
                                Is Java 8 java.util.Base64 a drop-in replacement for sun.misc.BASE64?
                            
                                hashlib.md5() TypeError: Unicode-objects must be encoded before hashing
                            
                                Encoding::UndefinedConversionError
                            
                                Git Shell in Windows: patch's default character encoding is UCS-2 Little Endian - how to change this to ANSI or UTF-8 without BOM?
                            
                                How to encode a WAV to a mp3 on a Android device
                            
                                What exactly is sun.jnu.encoding?
                            
                                Encoding of window.location.hash
                            
                                Charles Proxy Response unreadable
                            
                                Why does anyone use an encoding other than UTF-8? [closed]
                            
                                Python 3: How to specify stdin encoding
                            
                                How to change default encoding in NetBeans 8.0 [duplicate]
                            
                                Insert special character using :before pseudo class in css
                            
                                Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With