I'm trying to output unicode string into RTF format. (using c# and winforms)
From wikipedia:
If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.
I don't know how to convert Unicode character into Unicode codepoint ("\u1576"). Conversion to UTF 8, UTF 16 and similar is easy, but I don't know how to convert to codepoint.
Scenario in which I use this:
Problem, arise when Unicode characters arrived
Unicode RTF. Word 2000 is a Unicode-enabled application. Text is handled using the 16-bit Unicode character encoding scheme. Expressing this text in RTF requires a new mechanism, because until this release (version 1.6), RTF has only handled 7-bit characters directly and 8-bit characters encoded as hexadecimal.
It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.
Looks like RTF doesn't know UTF-8 at all, only Unicode in general. Other answers for Java and C# just use the \u directly.
Provided that all the characters that you're catering for exist in the Basic Multilingual Plane (it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.
Wikipedia:
All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.
The following sample program illustrates doing something along the lines of what you want:
static void Main(string[] args) { // ë char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 }); var sw = new StreamWriter(@"c:/helloworld.rtf"); sw.WriteLine(@"{\rtf {\fonttbl {\f0 Times New Roman;}} \f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World! }"); sw.Close(); } static string GetRtfUnicodeEscapedString(string s) { var sb = new StringBuilder(); foreach (var c in s) { if (c <= 0x7f) sb.Append(c); else sb.Append("\\u" + Convert.ToUInt32(c) + "?"); } return sb.ToString(); }
The important bit is the Convert.ToUInt32(c)
which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. The System.Text.Encoding.Unicode
encoding corresponds to UTF-16 as per the MSDN documentation.
Fixed code from accepted answer - added special character escaping, as described in this link
static string GetRtfUnicodeEscapedString(string s) { var sb = new StringBuilder(); foreach (var c in s) { if(c == '\\' || c == '{' || c == '}') sb.Append(@"\" + c); else if (c <= 0x7f) sb.Append(c); else sb.Append("\\u" + Convert.ToUInt32(c) + "?"); } return sb.ToString(); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With