Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?

For example, executing a command like this:

string s = (string) Clipboard.GetData(DataFormats.Html)

Results in stuff like:

<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced 
Search</A><BR>  <A href="/preferences?hl=en">Preferences</A><BR>  <A 
href="/language_tools?hl=en">Language 
Tools</A></FONT>

Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.

It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

like image 667
Winston Fassett Avatar asked Oct 27 '08 01:10

Winston Fassett


1 Answers

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).

It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.

For my other project I made a function that fix data with corrupted encoding.

In this case simple conversion should be sufficient:

byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);

My original function is a little bit more complex and contains tests to ensure that data are not corrupted...

public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
  if (string.IsNullOrEmpty(text))
    return false;
  byte[] data = encoding.GetBytes(text);
  // there should not be any character outside source encoding
  string newStr = encoding.GetString(data);
  if (!string.Equals(text, newStr)) // if there is any character "outside"
    return false; // leave, the input is in a different encoding
  if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
    return false; // if not, can not convert to UTF-8
  text = Encoding.UTF8.GetString(data);
  return true;
}

I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...

EDIT: (July 20, 2017)

It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0) (Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

like image 156
Julo Avatar answered Jan 02 '23 03:01

Julo