Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Generate all the characters in the UTF-8 charset in .net

I have been given the task of generating all the characters in the UTF-8 character set to test how a system handles each of them. I do not have much experience with character encoding. The approaching I was going to try was to increment a counter, and then try to translate that base ten number into it's equivalent UTF-8 character, but so far I have no been able to find an effective way to to this in C# 3.5

Any suggestions would be greatly appreciated.

like image 243
FireWire Avatar asked Nov 03 '09 16:11

FireWire


People also ask

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What is UTF-8 C#?

UTF-8 is a Unicode encoding that represents each code point as a sequence of one to four bytes. Unlike the UTF-16 and UTF-32 encodings, the UTF-8 encoding does not require "endianness"; the encoding scheme is the same regardless of whether the processor is big-endian or little-endian.

Are .NET strings UTF-16?

NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit.


3 Answers

System.Net.WebClient client = new System.Net.WebClient();
string definedCodePoints = client.DownloadString(
                         "http://unicode.org/Public/UNIDATA/UnicodeData.txt");
System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
while(true) {
  string line = reader.ReadLine();
  if(line == null) break;
  int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF) {
    //surrogate boundary; not valid codePoint, but listed in the document
  } else {
    string utf16 = char.ConvertFromUtf32(codePoint);
    byte[] utf8 = encoder.GetBytes(utf16);
    //TODO: something with the UTF-8-encoded character
  }
}

The above code should iterate over the currently assigned Unicode characters. You'll probably want to parse the UnicodeData file locally and fix any C# blunders I've made.

The set of currently assigned Unicode characters is less than the set that could be defined. Of course, whether you see a character when you print one of them out depends on a great many other factors, like fonts and the other applications it'll pass through before it is emitted to your eyeball.

like image 67
McDowell Avatar answered Oct 22 '22 09:10

McDowell


There is no "UTF-8 characters". Do you mean Unicode characters or UTF-8 encoding of Unicode characters?

It's easy to convert an int to a Unicode character, provided of course that there is a mapping for that code:

char c = (char)theNumber;

If you want the UTF-8 encoding for that character, that's not very hard either:

byte[] encoded = Encoding.UTF8.GetBytes(c.ToString())

You would have to check the Unicode standard to see the number ranges where there are Unicode characters defined.

like image 7
Guffa Avatar answered Oct 22 '22 10:10

Guffa


Even once you generate all the characters, you'll find it's not an effective test. Some of the characters are combining marks, which means they will combine with the next character to come after them - having a string full of combining marks won't make much sense. There are other special cases too. You'll be much better off using actual text in the languages you need to support.

like image 5
Mark Ransom Avatar answered Oct 22 '22 09:10

Mark Ransom