Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)

Why I Want To Convert ANSI → UTF-8

I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!

So my questions are

  1. Isn't there an industry standard for vCard files' character encoding?
  2. Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?

tl;dr Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.

like image 927
GPX Avatar asked Dec 08 '10 11:12

GPX


People also ask

How do I convert ANSI to UTF-8?

Try Settings -> Preferences -> New document -> Encoding -> choose UTF-8 without BOM, and check Apply to opened ANSI files . That way all the opened ANSI files will be treated as UTF-8 without BOM.

Is Windows 1252 a subset of UTF-8?

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

What is Windows 1252 encoding?

Windows 1252 (CP1252, Windows-1252, Windows CP1252, Windows Latin Western, Windows Latin, Windows ANSI) is a character encoding used in Microsoft Windows systems, particularly English-language installations. It is one of the Windows encodings.

How do I convert a File to UTF-8?

Click File > Save As. You will see the Save dialog box. Via the File Format dropdown menu, select the CSV UTF-8 option. Click Save.


1 Answers

You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.

Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.

If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.

like image 113
Guffa Avatar answered Sep 20 '22 03:09

Guffa