Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

System.Uri drops Unicode RLM (Right-to-Left Mark; U+200F) character in .NET 4.5+

Tags:

c#

uri

unicode

using System;

namespace UnicodeRlm
{
    class Program
    {
        static void Main(string[] args)
        {
            var uri = new Uri(
                "https://example.com/attachments/The title is \"مفتاح معايير الويب!‏\" in Arabic.pdf");
            Console.WriteLine(uri.AbsolutePath);
            Console.WriteLine(uri.AbsolutePath.Length);
        }
    }
}

Under .NET 4.0, this produces

/attachments/The%20title%20is%20%22%D9%85%D9%81%D8%AA%D8%A7%D8%AD%20%D9%85%D8%B9%D8%A7%D9%8A%D9%8A%D8%B1%20%D8%A7%D9%84%D9%88%D9%8A%D8%A8!%E2%80%8F%22%20in%20Arabic.pdf
168

Under .NET 4.5+, this produces

/attachments/The%20title%20is%20%22%D9%85%D9%81%D8%AA%D8%A7%D8%AD%20%D9%85%D8%B9%D8%A7%D9%8A%D9%8A%D8%B1%20%D8%A7%D9%84%D9%88%D9%8A%D8%A8!%22%20in%20Arabic.pdf
159

.NET 4.5 drops the %E2%80%8F part, which is the RLM character:

...!%E2%80%8F%22%20in%20Arabic.pdf
...!%22%20in%20Arabic.pdf

I have a hypothesis that this is caused by System.Uri escaping now supports RFC 3986, but my RFC-fu and Unicode-fu are failing me as to whether this RFC requires RLM to be dropped or wither this RLM character is placed correctly at all in the original string.

I'm not entirely sure whether this is the correct behavior standards-wise, but for me it's certainly not since I cannot download a file with an RLM character in the name in .NET 4.5 neither with WebClient nor with HttpWebRequest.

Is there any way to work around this quirk?

like image 786
Anton Gogolev Avatar asked Jan 20 '21 08:01

Anton Gogolev


People also ask

What is the Unicode character for RLM?

In Unicode, the RLM character is encoded at U+200F RIGHT-TO-LEFT MARK (HTML ‏ · ‏). In UTF-8 it is E2 80 8F . Usage is prescribed in the Unicode Bidi (bidirectional) Algorithm.

What is the right-to-LEFT MARK (RLM)?

The right-to-left mark (RLM) is a non-printing character used in the computerized typesetting of bi-directional text containing mixed left-to-right scripts (such as English and Cyrillic) and right-to-left scripts (such as Arabic, Syriac and Hebrew ). RLM is used to change the way adjacent characters are grouped with respect to text direction.

Is it better to use Arabic letter mark or RLM in HTML?

However, for Arabic script, Arabic letter mark may be a better choice. In Unicode, the RLM character is encoded at U+200F RIGHT-TO-LEFT MARK (HTML ‏ · ‏ ). In UTF-8 it is E2 80 8F. Usage is prescribed in the Unicode Bidi (bidirectional) Algorithm.

What is the right-to-left override character used for?

The Right-To-Left Override character can be used to force a right-to-left direction withing a text. This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc - so it seems to be a .doc file while in reality it is an .exe file.


1 Answers

In .Net 4.5 International Resource Identifier support was enabled by default. When targeting .Net 4.7.2 the right-to-left mark seems to be honored again, this could indicate there was a bug.

If the project needs to target .Net 4.5, the method ToggleIDNIRISupport in this post can help to overcome the issue.

Call the method like this:

ToggleIDNIRISupport(false);

When constructing the URI after this method call, it contains the right-to-left mark.

like image 61
alex-dl Avatar answered Oct 24 '22 20:10

alex-dl