Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert GB2312 to UTF-8

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?

Update: I'm using Windows XP (English install).

Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.

like image 609
Jon Tackabury Avatar asked Dec 18 '08 20:12

Jon Tackabury


2 Answers

You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.

For GB2312, you can use CP936 as the encoding.

If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.

All you need is something like this (I tested it and it works):

In C#

static void Main(string[] args) {
    string infile = args[0];
    string outfile = args[1];

    using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
        using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
            sw.Write(sr.ReadToEnd());
            sw.Close();
        }
        sr.Close();
    }
}

In VB.Net

Private Shared Sub Main(ByVal args() As String)
    Dim infile As String = args(0)
    Dim outfile As String = args(1)
    Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
    Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
    sw.Write(sr.ReadToEnd)
    sw.Close
    sr.Close
End Sub
like image 115
Renaud Bompuis Avatar answered Sep 27 '22 16:09

Renaud Bompuis


I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:

  1. Replace all & by &amp;, all < by &lt; and all > by &gt; (to be on the safe side)
  2. Prepend the following to the text file:

    <html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>

  3. Open the file in your favorite browser

  4. Select and copy all text
  5. Paste it in Notepad and save as UTF-8.

You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.

Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.

like image 27
mercator Avatar answered Sep 27 '22 16:09

mercator