Remove substring that starts with SOT and ends EOT, from string

Question

I have a program that reads certain strings from memory. The strings contain, for the most part, recognizable characters. At random points in the strings however, "weird" characters appear. Characters I did not recognize. By going to a site that allows me to paste in Unicode characters to see what they are, I found that a selection of the "weird" characters were these:

\x{1} SOH, "start of heading", ctrl-a
\x{2} SOT, "start of text"
\x{3} EOT, "end of text"
\x{7} BEL, bell, ctrl-g
\x{13} dc3, device control three, ctrl-s
\x{11} dc1, device control one, ctrl-q
\x{14} dc4, device control four, ctrl-t
\x{1A} sub, substitute, ctrl-z
\x{6} ack, acknowledge, ctrl-f

I wanted to parse my strings to remove these characters. What I found out though, by looking at the strings, was that all the unwanted characters were always surrounded by the SOT and EOT, respectively.

Therefore, I am thinking that my question is: How can I remove, from a string, all occurrences of substrings that starts with SOT and ends with EOT?

Edit: Attempt at Solution

Using ideas from @RagingCain I made the following method:

    private static string RemoveInvalidCharacters(string input)
    {
        while (true)
        {
            var start = input.IndexOf('\u0002');
            var end = input.IndexOf('\u0003', start);

            if (start == -1 || end == -1) break;

            Console.WriteLine(@"Start: " + start + @". End: " + end);
            var diff = end - start;
            input = input.Remove(start, diff);
        }
        return input;
    }

It does the trick, thanks again.

HouseCat · Accepted Answer

Regex would be your solution and should work fine. You would assign these characters to the Pattern and you can use the sub-method Match or even just Replace them with whitespace " ", or just cut them from the string all together by using "".

Regex.Replace: https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx

Regex.Match: https://msdn.microsoft.com/en-us/library/bk1x0726(v=vs.110).aspx

Regex example:

 public static void Main()
 {
   string input = "This is   text with   far  too   much   " + 
                  "whitespace.";
   string pattern = "\s+";
   string replacement = " ";
   Regex rgx = new Regex(pattern);
   string result = rgx.Replace(input, replacement);

   Console.WriteLine("Original String: {0}", input);
   Console.WriteLine("Replacement String: {0}", result);                             
 }

I know the difficulty though of not being able to "see" them so you should assign them to Char variables by Unicode itself, add them to the pattern for replace.

Char Variables: https://msdn.microsoft.com/en-us/library/x9h8tsay.aspx

Unicode for Start of Text: http://www.fileformat.info/info/unicode/char/0002/index.htm

Unicode for End of Text: http://www.fileformat.info/info/unicode/char/0003/index.htm

To apply to your solution: Does string contain SOT, EOT. If true, remove entire string/sub-string/SOT or EOT.

It maybe easier to split original string into a string[], then go line by line... it's difficult to parse through your string without knowing what it looks like so hopefully I provided something that helps ^.^

Remove substring that starts with SOT and ends EOT, from string

Tags:

string

c#

regex

Anders

1 Answers

HouseCat

Recent Activity

Donate For Us

Remove substring that starts with SOT and ends EOT, from string

Tags:

string

c#

regex

Anders

1 Answers

HouseCat

Related questions

Recent Activity

Donate For Us