Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding. They are displayed like that:

"Some text � with special � symbols in it".

I have no control over the encoding process, also data come from files of various origins and various formats. I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:

if(myString.Contains("�"))
{
   //Do stuff
}

While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?

EDIT:

After checking back with the team responsible for reading the files, this is how they do it:

var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();

Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.

We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.

So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.

The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

like image 632
Hal Avatar asked Feb 09 '17 16:02

Hal


1 Answers

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.

The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :

var sr = new StreamReader(filePath,Encoding.Default);    

You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.

Assuming the files are either some type of Unicode or match your system locale, you can use:

var sr = new StreamReader(filePath,Encoding.Default, true);

From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

like image 160
Panagiotis Kanavos Avatar answered Sep 17 '22 17:09

Panagiotis Kanavos