Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing hidden characters from within strings

My problem:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.

My question:
How can I detect and eliminate these hidden characters using C#?

like image 489
bradley4 Avatar asked Mar 06 '13 22:03

bradley4


People also ask

How do I remove hidden characters from a string?

Use the replace() method to remove all special characters from a string, e.g. str. replace(/[^a-zA-Z0-9 ]/g, ''); . The replace method will return a new string that doesn't contain any special characters.

How do I remove hidden characters from a string in Java?

string_variable. replaceAll("\\p{C}", "?"); This will replace all non-printable characters. Where p{C} selects the invisible control characters and unused code points.

How do I remove the last 3 characters from a string?

slice() method to remove the last 3 characters from a string, e.g. const withoutLast3 = str. slice(0, -3); . The slice method will return a new string that doesn't contain the last 3 characters of the original string.


2 Answers

You can remove all control characters from your input string with something like this:

string input; // this is your input string string output = new string(input.Where(c => !char.IsControl(c)).ToArray()); 

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray()); 
like image 134
Yannick Blondeau Avatar answered Sep 21 '22 17:09

Yannick Blondeau


I usually use this regular expression to replace all non-printable characters.

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

So here is the expression:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*"); 
  • ^ means if it's any of the following:
  • \u0009 is tab
  • \u000A is linefeed
  • \u000D is carriage return
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

To test above you can create a string by yourself like this:

    string input = string.Empty;      for (int i = 0; i < 255; i++)     {         input += (char)(i);     } 
like image 44
Mubashar Avatar answered Sep 19 '22 17:09

Mubashar