Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match.Value and international characters

UPDATE May this post be helpful for coders using RichTextBoxes. The Match is correct for a normal string, I did not see this AND I did not see that "ä" transforms to "\e4r" in the richTextBox.Rtf! So the Match.Value is correct - human error.

A RegEx finds the correct text but Match.Value is wrong because it replaces the german "ä" with "\'e4"!

Let example_text = "Primär-ABC" and lets use the following code

String example_text = "<em>Primär-ABC</em>";
Regex em = new Regex(@"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text); //Works!
Match emMatch = em.Match(richtextBox.RTF); //Fails!
while (emMatch.Success)
{
  string matchValue = emMatch.Value;
  Foo(matchValue) ...
}

then the emMatch.Value returns "Prim\'e4r-ABC" instead of "Primär-ABC".

The German ä transforms to \'e4! Because I want to work with the exact string, i would need emMatch.Value to be Primär-ABC - how do I achieve that?

like image 481
user1338270 Avatar asked Jul 27 '12 08:07

user1338270


1 Answers

In what context are you doing this?

string example_text = "<em>Ich bin ein Bärliner</em>";
Regex em = new Regex(@"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text);
while (emMatch.Success)
{
    Console.WriteLine(emMatch.Value);
    emMatch = emMatch.NextMatch();
}

This outputs <em>Ich bin ein Bärliner</em> in my console

The problem probably isn't that you're getting the wrong value back, it's that you're getting a representation of the value that isn't displayed correctly. This can depend on a lot of things. Try writing the value to a text file using UTF8 encoding and see if it still is incorrect.

Edit: Right. The thing is that you are getting the text from a WinForms RichTextBox using the Rtf property. This will not return the text as is, but will return the RTF representation of the text. RTF is not plain text, it's a markup format to display rich text. If you open an RTF document in e.g. Notepad you will see that it has a lot of weird codes in it - including \'e4 for every 'ä' in your RTF document. If you would've used some markup (like bold text, color etc) in the RTF box, the .Rtf property would return that code as well, looking something like {\rtlch\fcs1 \af31507 \ltrch\fcs0 \cf6\insrsid15946317\charrsid15946317 test}

So use the .Text property instead. It will return the actual plain text.

like image 138
Anders Arpi Avatar answered Oct 21 '22 19:10

Anders Arpi