I have a program that reads certain strings from memory. The strings contain, for the most part, recognizable characters. At random points in the strings however, "weird" characters appear. Characters I did not recognize. By going to a site that allows me to paste in Unicode characters to see what they are, I found that a selection of the "weird" characters were these:
I wanted to parse my strings to remove these characters. What I found out though, by looking at the strings, was that all the unwanted characters were always surrounded by the SOT and EOT, respectively.
Therefore, I am thinking that my question is: How can I remove, from a string, all occurrences of substrings that starts with SOT and ends with EOT?
Edit: Attempt at Solution
Using ideas from @RagingCain I made the following method:
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0002');
var end = input.IndexOf('\u0003', start);
if (start == -1 || end == -1) break;
Console.WriteLine(@"Start: " + start + @". End: " + end);
var diff = end - start;
input = input.Remove(start, diff);
}
return input;
}
It does the trick, thanks again.
Regex would be your solution and should work fine. You would assign these characters to the Pattern and you can use the sub-method Match or even just Replace them with whitespace " ", or just cut them from the string all together by using "".
Regex.Replace: https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx
Regex.Match: https://msdn.microsoft.com/en-us/library/bk1x0726(v=vs.110).aspx
Regex example:
public static void Main()
{
string input = "This is text with far too much " +
"whitespace.";
string pattern = "\\s+";
string replacement = " ";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
Console.WriteLine("Original String: {0}", input);
Console.WriteLine("Replacement String: {0}", result);
}
I know the difficulty though of not being able to "see" them so you should assign them to Char variables by Unicode itself, add them to the pattern for replace.
Char Variables: https://msdn.microsoft.com/en-us/library/x9h8tsay.aspx
Unicode for Start of Text: http://www.fileformat.info/info/unicode/char/0002/index.htm
Unicode for End of Text: http://www.fileformat.info/info/unicode/char/0003/index.htm
To apply to your solution: Does string contain SOT, EOT. If true, remove entire string/sub-string/SOT or EOT.
It maybe easier to split original string into a string[], then go line by line... it's difficult to parse through your string without knowing what it looks like so hopefully I provided something that helps ^.^
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With