Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cyrillic encoding in C#

I have a bunch of Cyrillic-like text in a MSSQL database and need to convert it to Cyrillic in C#.

So... Ðàáîòà â ãåðìàíèè

should become

Работа в германии

Any suggestions?

I should add that the closest I've gotten is ?aaioa a aa?iaiee

Here's the code I'm using:

 str = Encoding.UTF8.GetString(Encoding.GetEncoding("Windows-1251").GetBytes(drCurrent["myfield"].ToString()));
 str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));
like image 471
user1541301 Avatar asked Aug 17 '12 21:08

user1541301


People also ask

Does UTF-8 contain Cyrillic?

UTF-8. 128 characters are encoded using 1 byte (the ASCII characters). 1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters).

Are Cyrillic characters in Unicode?

Cyrillic is a Unicode block containing the characters used to write the most widely used languages with a Cyrillic orthography. The core of the block is based on the ISO 8859-5 standard, with additions for minority languages and historic orthographies.


2 Answers

// To find out source and target
const string source = "Ðàáîòà â ãåðìàíèè";
const string destination = "Работа в германии";

foreach (var sourceEncoding in Encoding.GetEncodings())
{

    var bytes = sourceEncoding.GetEncoding().GetBytes(source);
    foreach (var targetEncoding in Encoding.GetEncodings())
    {
        if (targetEncoding.GetEncoding().GetString(bytes) == destination)
        {
            Console.WriteLine("Source Encoding: {0} TargetEncoding: {1}",sourceEncoding.CodePage,targetEncoding.CodePage);
        }

    }
}

// Result1: Source Encoding: 1252 TargetEncoding: 1251
// Result2: Source Encoding: 28591 TargetEncoding: 1251
// Result3: Source Encoding: 28605 TargetEncoding: 1251

// The code for you to use 
var decodedCyrillic = Encoding.GetEncoding(1251).GetString(Encoding.GetEncoding(1252).GetBytes(source));
// Result: Работа в германии
like image 59
M. Mennan Kara Avatar answered Oct 13 '22 13:10

M. Mennan Kara


ADO.Net exposes all string types from SQL Server provider as C# strings, which implies that they were already converted to Unicode. For non-unicode source columns (as yours obviously is) like char(n) or varchar(n), the ADO.Net SQL Server provider uses the source collation information to determine the encoding. Therefore if your non-unicode SQL Server data gets represented in .Net with the wrong encoding, it must had been presented to the provider with the wrong collation. Choose an appropriate collation for your data and ADO.Net provider for SQL Server will translate it using the appropriate encoding. For example, as documented in Collation and Code Page Architecture, Cyrillic collations will result in code page 1251, which is very likely what you want. The articles linked contain all the information you need to fix your problem.

using System;
using System.Text;
using System.Data.SqlClient;
using System.Windows.Forms;

public class Hello1
{
   public static void Main()
   {
    try
    {
        using (SqlConnection conn = new SqlConnection("server=.;integrated security=true"))
        {
            conn.Open ();

            // The .cs file must be saved as Unicode, obviously...
            //
            string s = "Работа в германии"; 

            byte[] b = Encoding.GetEncoding(1251).GetBytes (s);

            // Create a test table
            //
            SqlCommand cmd = new SqlCommand (
                @"create table #t (
                    c1 varchar(100) collate Latin1_General_CI_AS, 
                    c2 varchar(100) collate Cyrillic_General_CI_AS)", 
                conn);
            cmd.ExecuteNonQuery ();

            // Insert the same value twice, the original Unicode string
            // encoded as CP1251
            //
            cmd = new SqlCommand (
                @"insert into #t (c1, c2) values (@b, @b)", conn);
            cmd.Parameters.AddWithValue("@b", b);
            cmd.ExecuteNonQuery ();

            // Read the value as Latin collation 
            //
            cmd = new SqlCommand (
                @"select c1 from #t", conn);
            string def = (string) cmd.ExecuteScalar ();

            // Read the same value as Cyrillic collation
            //
            cmd = new SqlCommand (
                @"select c2 from #t", conn);
            string cyr = (string) cmd.ExecuteScalar ();

            // Cannot use Console.Write since the console is not Unicode
            //
            MessageBox.Show(String.Format(
                @"Original: {0}  Default collation: {1} Cyrillic collation: {2}", 
                    s, def, cyr));
        }

    }
    catch(Exception e)
    {
        Console.WriteLine (e);
    }   
   }
}

The result is:

---------------------------

---------------------------
Original: Работа в германии  Default collation: Ðàáîòà â ãåðìàíèè Cyrillic collation: Работа в германии
---------------------------
OK   
---------------------------
like image 4
Remus Rusanu Avatar answered Oct 13 '22 12:10

Remus Rusanu