I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "ï¿½" for a normal space, " ". The character "ï¿½" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation. I'm trying do the replace this way: <pre class="prettyprint"><code>public string Errors = ""; public void test(){ string textFromCsvCell = ""; string validCharacters = "^[0-9A-Za-z().:%-/ ]+$"; textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " " string cleaned = textFromCsvCell.Replace("\uFFFD", "\"") if (Regex.IsMatch(cleaned, validCharacters )) //All code for insert else Errors=cleaned; //print Errors } </code></pre> The test method shows me this text: "This is myï¿½texto from csv file" I try some solutions too: Trying solution 1: Using Trim <pre class="prettyprint"><code> Regex.Replace(value.Trim(), @"[^\S\r\n]+", " "); </code></pre> Try solution 2: Using Replace <pre class="prettyprint"><code> System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " "); </code></pre> Try solution 3: Using Trim <pre class="prettyprint"><code> String.Trim(new char[]{'\uFEFF', '\u200B'}); </code></pre> Try solution 4: Add [\S\r\n] to validCharacters <pre class="prettyprint"><code> string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$"; </code></pre> Nothing works. How can I replace it? Sources: <ul> <li> Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD) </li> <li> Trying to replace all white space with a single space </li> <li> Strip the byte order mark from string in C# </li> <li> Remove extra whitespaces, but keep new lines using a regular expression in C# </li> </ul> <h3>EDITED</h3> This is the original string: "SYSTEM OF MONITORING CONTINUES OF GLUCOSE" in 0x... notation SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE <h3>Solution</h3> Go to the Unicode code converter. Look at the conversions and do the replace. In my case, I do a simple replace: <pre class="prettyprint"><code> string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE"; //value contains non-breaking whitespace //value is "SYSTEM OFï¿½MONITORING CONTINUES OF GLUCOSE" string cleaned = ""; string pattern = @"[^\u0000-\u007F]+"; string replacement = " "; Regex rgx = new Regex(pattern); cleaned = rgx.Replace(value, replacement); if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){ //all code for insert else //Error messages </code></pre> This expression represents all possible spaces: space, tab, page break, line break and carriage return <pre class="prettyprint"><code>[ \f\n\r\t\v\u00a0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000] </code></pre> References <ul> <li> Regular expressions (MDN)</li> </ul>

Using String.Replace: Use a simple <code>String.Replace()</code>. I've assumed that the only characters you want to remove are the ones you've mentioned in the question: <code>ï¿½</code> and you want to replace them by a normal space. <pre class="prettyprint"><code>string text = "impï¿½ortant"; string cleaned = text.Replace('\u00ef', ' ') .Replace('\u00bf', ' ') .Replace('\u00bd', ' '); // Returns 'imp ortant' </code></pre> Or using Regex.Replace: <pre class="prettyprint"><code>string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " "); // Returns 'imp ortant' </code></pre> Try it out: Dotnet Fiddle

Define a range of ASCII characters, and replace anything that is not within that range. <hr> We want to find only Unicode characters, so we will match on a Unicode character and replace. <pre class="prettyprint"><code>Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ") </code></pre> The above pattern will match anything that is not <code>^</code> in the set <code>[ ]</code> of this range <code>\u0000-\u007F</code> (ASCII characters (everything past \u007F is Unicode)) and replace it with a space. Result <pre class="prettyprint"><code>This is my te xt from csv file </code></pre> <hr> You can adjust the range provided <code>\u0000-\u007F</code> as needed to expand the range of allowed characters to suit your needs.

Replace Unicode Character "ï¿½" With A Space

Tags:

c#

regex

validation

trim

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "ï¿½" for a normal space, " ".

The character "ï¿½" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.

I'm trying do the replace this way:

public string Errors = "";

public void test(){

    string textFromCsvCell = "";
    string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
    textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
    string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
      if (Regex.IsMatch(cleaned, validCharacters ))
        //All code for insert
      else
         Errors=cleaned;
         //print Errors
}

The test method shows me this text:

"This is myï¿½texto from csv file"

I try some solutions too:

Trying solution 1: Using Trim

 Regex.Replace(value.Trim(), @"[^\S\r\n]+", " ");

Try solution 2: Using Replace

  System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ");

Try solution 3: Using Trim

  String.Trim(new char[]{'\uFEFF', '\u200B'});

Try solution 4: Add [\S\r\n] to validCharacters

  string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";

Nothing works.

How can I replace it?

Sources:

Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
Trying to replace all white space with a single space
Strip the byte order mark from string in C#
Remove extra whitespaces, but keep new lines using a regular expression in C#

EDITED

This is the original string:

"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"

in 0x... notation

SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE

Solution

Go to the Unicode code converter. Look at the conversions and do the replace.

In my case, I do a simple replace:

 string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
 //value contains non-breaking whitespace
 //value is "SYSTEM OFï¿½MONITORING CONTINUES OF GLUCOSE"
 string cleaned = "";
 string pattern = @"[^\u0000-\u007F]+";
 string replacement = " ";

 Regex rgx = new Regex(pattern);
 cleaned = rgx.Replace(value, replacement);

 if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
    //all code for insert
 else
    //Error messages

This expression represents all possible spaces: space, tab, page break, line break and carriage return

[ \f\n\r\t\v\u00a0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]

References

Regular expressions (MDN)

507

asked May 16 '17 13:05

Diego Ferb

2 Answers

Using String.Replace:

Use a simple String.Replace().

I've assumed that the only characters you want to remove are the ones you've mentioned in the question: ï¿½ and you want to replace them by a normal space.

string text = "impï¿½ortant";
string cleaned = text.Replace('\u00ef', ' ')
        .Replace('\u00bf', ' ')
        .Replace('\u00bd', ' ');
// Returns 'imp   ortant'

Or using Regex.Replace:

string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp   ortant'

Try it out: Dotnet Fiddle

170

answered Oct 06 '22 23:10

degant

Define a range of ASCII characters, and replace anything that is not within that range.

We want to find only Unicode characters, so we will match on a Unicode character and replace.

Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ")

The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.

Result

This is my te xt from csv file

You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.

answered Oct 06 '22 23:10

ΩmegaMan

Related questions
                            
                                How do I check if current code is "inside" lock?
                            
                                How is C# Separate From the .NET Framework?
                            
                                What happens if there are no IO threads to handle async result?
                            
                                How to validate required fields in class properties?
                            
                                Initializing a large jagged array takes over 1 GB of RAM and crashes with StackOverflowException
                            
                                Troubleshooting Identity Server 4
                            
                                Deserializing an unnamed array
                            
                                Disposing object from same object
                            
                                Add-Migration showing error EntityFrameworkCore.Design is not installed
                            
                                Using ExceptionFilterAttribute in Web API
                            
                                Binding works without INotifyPropertyChanged, why?
                            
                                Build project in 2017 Visual Studio from the command line?
                            
                                Control or access custom printer settings (Rotate 180 degrees) for printers in C#
                            
                                Calculate percentile from t-score in .NET
                            
                                c# MultipartFormDataContent Add methods (how to properly add a file)
                            
                                Replacement for DataTable/DataSet in .NET Core (NET Standard 2.1)
                            
                                CRM 365 callManagerInfo error in plugins
                            
                                Stackexchange.Redis timeout exception in .net-core
                            
                                How can I embed any file type into Microsoft Word without interop assemblies
                            
                                Improve the performance of dacpac deployment using c#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With