How to Generate all the characters in the UTF-8 charset in .net

Tags:

I have been given the task of generating all the characters in the UTF-8 character set to test how a system handles each of them. I do not have much experience with character encoding. The approaching I was going to try was to increment a counter, and then try to translate that base ten number into it's equivalent UTF-8 character, but so far I have no been able to find an effective way to to this in C# 3.5

Any suggestions would be greatly appreciated.

243

asked Nov 03 '09 16:11

FireWire

3 Answers

System.Net.WebClient client = new System.Net.WebClient();
string definedCodePoints = client.DownloadString(
                         "http://unicode.org/Public/UNIDATA/UnicodeData.txt");
System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
while(true) {
  string line = reader.ReadLine();
  if(line == null) break;
  int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF) {
    //surrogate boundary; not valid codePoint, but listed in the document
  } else {
    string utf16 = char.ConvertFromUtf32(codePoint);
    byte[] utf8 = encoder.GetBytes(utf16);
    //TODO: something with the UTF-8-encoded character
  }
}

The above code should iterate over the currently assigned Unicode characters. You'll probably want to parse the UnicodeData file locally and fix any C# blunders I've made.

The set of currently assigned Unicode characters is less than the set that could be defined. Of course, whether you see a character when you print one of them out depends on a great many other factors, like fonts and the other applications it'll pass through before it is emitted to your eyeball.

answered Oct 22 '22 09:10

McDowell

There is no "UTF-8 characters". Do you mean Unicode characters or UTF-8 encoding of Unicode characters?

It's easy to convert an int to a Unicode character, provided of course that there is a mapping for that code:

char c = (char)theNumber;

If you want the UTF-8 encoding for that character, that's not very hard either:

byte[] encoded = Encoding.UTF8.GetBytes(c.ToString())

You would have to check the Unicode standard to see the number ranges where there are Unicode characters defined.

answered Oct 22 '22 10:10

Guffa

Even once you generate all the characters, you'll find it's not an effective test. Some of the characters are combining marks, which means they will combine with the next character to come after them - having a string full of combining marks won't make much sense. There are other special cases too. You'll be much better off using actual text in the languages you need to support.

answered Oct 22 '22 09:10

Mark Ransom

Related questions
                            
                                Capture exception during request deserialization in WebAPI C#
                            
                                How to identify doc, docx, pdf, xls and xlsx based on file header
                            
                                Enum in WPF ComboxBox with localized names
                            
                                Try-Catch-Finally block problems with .NET4.5.1
                            
                                Better TypeInitializationException (innerException is also null)
                            
                                Create directory if not exists
                            
                                What is the purpose of remarks tag in c#
                            
                                How to make lazy-loading work with EF Core 2.1.0 and proxies
                            
                                Razor Pages, form page handler not working with GET method
                            
                                Error build VSTS: ## [error] Error: Unable to locate the 'nuget'
                            
                                'IServiceCollection' does not contain a definition for 'AddSpaStaticFiles'
                            
                                Where is "ildasm" in Visual Studio 2019?
                            
                                What is the equivalent of Newtonsoft.Json's / Json.Net's JsonProperty field in System.Text.Json?
                            
                                What tools exist to convert a Delphi 7 application to C# and the .Net framework? [closed]
                            
                                What is Native Code?
                            
                                C# delegate for two methods with different parameters
                            
                                Change the node names in an XML file using C#
                            
                                Bluetooth in C#, Which stack, Which SDK?
                            
                                How to add day to date in Linq to SQL
                            
                                WPF Error Styles only being rendered properly on visible tab of a tab control

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Generate all the characters in the UTF-8 charset in .net

Tags:

c#

.net

character-encoding

utf-8