Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List of all characters that String.Trim() removes?

Tags:

c#

.net

Obviously primary usage of Trim is to remove leading and ending whitespace from string like:

"  hello  ".Trim(); // results in "hello"

But Trim also removes extra characters like \n, \r and \t, so:

"  \nhello\r\t  ".Trim(); // it also produces "hello"

Is there a definite list of all characters (preferably in string escaped format, like \n) that Trim will remove?

EDIT: Thanks for detailed answers - I now know EXACT chars. This Wikipedia list that @RayKoopa left in comments is probably best looking format for me.

like image 251
nikib3ro Avatar asked Dec 01 '22 13:12

nikib3ro


1 Answers

We can take a look at the source code for the String class here

The public Trim() method calls an internal helper method named TrimHelper():

 public String Trim() {
        Contract.Ensures(Contract.Result<String>() != null);
        Contract.EndContractBlock();

        return TrimHelper(TrimBoth);        
 }

TrimHelper() looks like this:

[System.Security.SecuritySafeCritical]  // auto-generated
        private String TrimHelper(int trimType) {
            //end will point to the first non-trimmed character on the right
            //start will point to the first non-trimmed character on the Left
            int end = this.Length-1;
            int start=0;

            //Trim specified characters.
            if (trimType !=TrimTail)  {
                for (start=0; start < this.Length; start++) {
                    if (!Char.IsWhiteSpace(this[start]) && !IsBOMWhitespace(this[start])) break;
                }
            }

            if (trimType !=TrimHead) {
                for (end= Length -1; end >= start;  end--) {
                    if (!Char.IsWhiteSpace(this[end])  && !IsBOMWhitespace(this[start])) break;
                }
            }

            return CreateTrimmedString(start, end);
        }

So the bulk of your question basically lies in the check for Char.IsWhiteSpace method,

char.cs

   [Pure]
    public static bool IsWhiteSpace(char c) {

        if (IsLatin1(c)) {
            return (IsWhiteSpaceLatin1(c));
        }
        return CharUnicodeInfo.IsWhiteSpace(c);
    }

If it's a Latin character, then this is what constitutes white space:

 private static bool IsWhiteSpaceLatin1(char c) {

            // There are characters which belong to UnicodeCategory.Control but are considered as white spaces.
            // We use code point comparisons for these characters here as a temporary fix.

            // U+0009 = <control> HORIZONTAL TAB
            // U+000a = <control> LINE FEED
            // U+000b = <control> VERTICAL TAB
            // U+000c = <contorl> FORM FEED
            // U+000d = <control> CARRIAGE RETURN
            // U+0085 = <control> NEXT LINE
            // U+00a0 = NO-BREAK SPACE
            if ((c == ' ') || (c >= '\x0009' && c <= '\x000d') || c == '\x00a0' || c == '\x0085') {
                return (true);
            }
            return (false);
        }

Otherwise we have to go to CharUnicodeInfo.cs, which uses an Enum to check the whitespace character

   internal static bool IsWhiteSpace(char c)
        {
            UnicodeCategory uc = GetUnicodeCategory(c);
            // In Unicode 3.0, U+2028 is the only character which is under the category "LineSeparator".
            // And U+2029 is th eonly character which is under the category "ParagraphSeparator".
            switch (uc) {
                case (UnicodeCategory.SpaceSeparator):
                case (UnicodeCategory.LineSeparator):
                case (UnicodeCategory.ParagraphSeparator):
                    return (true);
            }

            return (false);
        }
like image 63
TEK Avatar answered Dec 06 '22 17:12

TEK