What characters do not directly map from Cp1252 to UTF-8?

Tags:

I've read in several stackoverflow answers that some characters do not directly map (or are even "unmappable") when converting from Cp1252 (aka Windows-1252; they're the same, aren't they?) to UTF-8, e.g. here: https://stackoverflow.com/a/23399926/2018047

Can someone please shed some more light on this? Does that mean that if I batch/mass convert source code from cp1252 to utf-8 I'll get some characters that will end up as garbage?

943

asked Oct 12 '14 11:10

Christian

2 Answers

This is how Windows 1252 codepage looks like.

As you can see, bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D do not have anything assigned to them.

If your input file contains those bytes, and you treat it as if it was in Windows 1252 encoding, those bytes will be treated as invalid characters. In normal circumstances, this means that the input file was not in Windows 1252.

All other bytes encode either printable characters or control characters, and all those characters are present in Unicode and therefore can unambiguously be encoded in UTF-8.

I have no idea what the linked answer is trying to claim, its last paragraph sounds like nonsense.

Several more remarks, which may shine some light on what you are trying to get to know:

UTF-8 and Windows 1252 are totally incompatible with each other outside ASCII
both of those encodings will never encode text to certain byte values, different ones in each case
moreover, certain byte sequences are also invalid in UTF-8
in general, if you treat a file as if it contained text encoded in UTF-8 or Windows 1252, but it doesn't, you will lose and corrupt data

You can select the encoding of your files in your IDE or editor. It's recommended to go UTF-8 only. You will have to convert existing Windows 1252 files.

155

answered Sep 29 '22 07:09

Karol S

Can someone please shed some more light on this?

The cp1252 decoding function is mostly an identity function.

cp1252    UCP       (UCP = Unicode Code Point)
--------  --------
21        21 (!)    (All numbers in hex)
31        31 (1)
41        41 (A)

This makes it seem like something expecting UCP (not UTF-8) will also accept cp1252. The author of the linked Answer is pointing out that this is not the case.

cp1252    UCP
--------  --------
80        20AC (€)
85        2026 (…)
99        2122 (™)

The exceptions are all found between 80 and 9F, inclusive.

Something that accepts UCP will also accept iso-8859-1, but not cp1252.

Does that mean that if I batch/mass convert source code from cp1252 to utf-8 I'll get some characters that will end up as garbage?

No. Every character in cp1252 maps to a Unicode Code, so it can successfully be converted to UTF-8 using a proper tool.

answered Sep 29 '22 08:09

ikegami

Related questions
                            
                                Is this the correct way to send email with PHP?
                            
                                Unicode issue with an HTML Title, question mark? 65533;
                            
                                Ruby to_json issue with error "illegal/malformed utf-8"
                            
                                Docker Python set utf-8 locale
                            
                                django test database is not created with utf8
                            
                                How to unmask a JavaFX PasswordField or properly mask a TextField?
                            
                                UTF-8 in PHP regular expressions [duplicate]
                            
                                Special character not displaying as expected
                            
                                pandas to_csv: ascii can't encode character
                            
                                Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8): error in ROR + MongoDB based app
                            
                                Create an utf-8 csv file in Python
                            
                                In vim search and replace, newline is rendering as "^@" [duplicate]
                            
                                C / C++ UTF-8 upper/lower case conversions
                            
                                Icelandic, utf8 and utf8x in LaTeX
                            
                                How to remove bad characters that are not suitable for utf8 encoding in MySQL?
                            
                                How to enable UTF-8 in jsPDF library
                            
                                Can I set the default string encoding on Ruby 1.9?
                            
                                text to pdf with utf8 encoding (alternative to a2ps)
                            
                                How to set charset encoding property for SVN File and Eclipse
                            
                                python print statement with utf-8 and nohup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What characters do not directly map from Cp1252 to UTF-8?

Tags:

character-encoding

utf-8

utf

codepages

cp1252

Christian

People also ask

2 Answers

Karol S

ikegami

Recent Activity

Donate For Us