Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace Unicode Character "�" With A Space

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "�" for a normal space, " ".

The character "�" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.

I'm trying do the replace this way:

public string Errors = "";

public void test(){

    string textFromCsvCell = "";
    string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
    textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
    string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
      if (Regex.IsMatch(cleaned, validCharacters ))
        //All code for insert
      else
         Errors=cleaned;
         //print Errors
}

The test method shows me this text:

"This is my�texto from csv file"

I try some solutions too:

Trying solution 1: Using Trim

 Regex.Replace(value.Trim(), @"[^\S\r\n]+", " ");

Try solution 2: Using Replace

  System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ");

Try solution 3: Using Trim

  String.Trim(new char[]{'\uFEFF', '\u200B'});

Try solution 4: Add [\S\r\n] to validCharacters

  string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";

Nothing works.

How can I replace it?

Sources:

  • Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)

  • Trying to replace all white space with a single space

  • Strip the byte order mark from string in C#

  • Remove extra whitespaces, but keep new lines using a regular expression in C#

EDITED

This is the original string:

"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"

in 0x... notation

SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE

Solution

Go to the Unicode code converter. Look at the conversions and do the replace.

In my case, I do a simple replace:

 string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
 //value contains non-breaking whitespace
 //value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE"
 string cleaned = "";
 string pattern = @"[^\u0000-\u007F]+";
 string replacement = " ";

 Regex rgx = new Regex(pattern);
 cleaned = rgx.Replace(value, replacement);

 if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
    //all code for insert
 else
    //Error messages

This expression represents all possible spaces: space, tab, page break, line break and carriage return

[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]

References

  • Regular expressions (MDN)
like image 507
Diego Ferb Avatar asked May 16 '17 13:05

Diego Ferb


People also ask

How do I type a specific Unicode character?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X.

What is the Unicode of character a?

Unicode Character “A” (U+0041)


2 Answers

Using String.Replace:

Use a simple String.Replace().

I've assumed that the only characters you want to remove are the ones you've mentioned in the question: � and you want to replace them by a normal space.

string text = "imp�ortant";
string cleaned = text.Replace('\u00ef', ' ')
        .Replace('\u00bf', ' ')
        .Replace('\u00bd', ' ');
// Returns 'imp   ortant'

Or using Regex.Replace:

string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp   ortant'

Try it out: Dotnet Fiddle

like image 170
degant Avatar answered Oct 06 '22 23:10

degant


Define a range of ASCII characters, and replace anything that is not within that range.


We want to find only Unicode characters, so we will match on a Unicode character and replace.

Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ")

The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.

Result

This is my te xt from csv file

You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.

like image 34
ΩmegaMan Avatar answered Oct 06 '22 23:10

ΩmegaMan