Obviously primary usage of Trim is to remove leading and ending whitespace from string like:
" hello ".Trim(); // results in "hello"
But Trim also removes extra characters like \n
, \r
and \t
, so:
" \nhello\r\t ".Trim(); // it also produces "hello"
Is there a definite list of all characters (preferably in string escaped format, like \n
) that Trim
will remove?
EDIT: Thanks for detailed answers - I now know EXACT chars. This Wikipedia list that @RayKoopa left in comments is probably best looking format for me.
We can take a look at the source code for the String
class here
The public Trim()
method calls an internal helper method named TrimHelper()
:
public String Trim() {
Contract.Ensures(Contract.Result<String>() != null);
Contract.EndContractBlock();
return TrimHelper(TrimBoth);
}
TrimHelper()
looks like this:
[System.Security.SecuritySafeCritical] // auto-generated
private String TrimHelper(int trimType) {
//end will point to the first non-trimmed character on the right
//start will point to the first non-trimmed character on the Left
int end = this.Length-1;
int start=0;
//Trim specified characters.
if (trimType !=TrimTail) {
for (start=0; start < this.Length; start++) {
if (!Char.IsWhiteSpace(this[start]) && !IsBOMWhitespace(this[start])) break;
}
}
if (trimType !=TrimHead) {
for (end= Length -1; end >= start; end--) {
if (!Char.IsWhiteSpace(this[end]) && !IsBOMWhitespace(this[start])) break;
}
}
return CreateTrimmedString(start, end);
}
So the bulk of your question basically lies in the check for Char.IsWhiteSpace
method,
char.cs
[Pure]
public static bool IsWhiteSpace(char c) {
if (IsLatin1(c)) {
return (IsWhiteSpaceLatin1(c));
}
return CharUnicodeInfo.IsWhiteSpace(c);
}
If it's a Latin character, then this is what constitutes white space:
private static bool IsWhiteSpaceLatin1(char c) {
// There are characters which belong to UnicodeCategory.Control but are considered as white spaces.
// We use code point comparisons for these characters here as a temporary fix.
// U+0009 = <control> HORIZONTAL TAB
// U+000a = <control> LINE FEED
// U+000b = <control> VERTICAL TAB
// U+000c = <contorl> FORM FEED
// U+000d = <control> CARRIAGE RETURN
// U+0085 = <control> NEXT LINE
// U+00a0 = NO-BREAK SPACE
if ((c == ' ') || (c >= '\x0009' && c <= '\x000d') || c == '\x00a0' || c == '\x0085') {
return (true);
}
return (false);
}
Otherwise we have to go to CharUnicodeInfo.cs
, which uses an Enum to check the whitespace character
internal static bool IsWhiteSpace(char c)
{
UnicodeCategory uc = GetUnicodeCategory(c);
// In Unicode 3.0, U+2028 is the only character which is under the category "LineSeparator".
// And U+2029 is th eonly character which is under the category "ParagraphSeparator".
switch (uc) {
case (UnicodeCategory.SpaceSeparator):
case (UnicodeCategory.LineSeparator):
case (UnicodeCategory.ParagraphSeparator):
return (true);
}
return (false);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With