Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as and , come out looking like � characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:

    StringBuilder sb = new StringBuilder();
    string text = File.ReadAllText(filePath);
    for (int i = 0; i < text.Length; ++i) {
        if (text[i] != '{') {  // looking for opening curly brace
            sb.Append(text[i]);
            continue;
        }
        // Do stuff
    }
    File.WriteAllText(destinationFile, sb.ToString());

I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting). I did read this article: http://www.joelonsoftware.com/articles/Unicode.html ...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!

like image 766
BrDaHa Avatar asked Nov 30 '12 02:11

BrDaHa


People also ask

What are smart quotes in coding?

from the oracle known as Google: “Smart quotes,” the correct quotation marks and apostrophes, are curly or sloped. "Dumb quotes," or straightquotes, are a vestigial constraint from typewriters when using one key for two different marks helped save space on a keyboard.

What do smart quotes look like?

As you type text, Word automatically changes straight quotation marks ( ' or " ) to curly quotation marks (also known as "smart quotes" or typographer's quotes).

What is Unicode for quote?

If you can use Unicode characters, nice directional quotation marks are available in the form of characters U+2018, U+2019, U+201C, and U+201D (as in 'quote' or “quote” ).


2 Answers

TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)


Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?

This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.

Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.

So read it as Windows-1252 and you're half-way there:

Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(@"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now

Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:

Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(@"C:\myFile", sb.toString(), windows1252);
like image 180
Esailija Avatar answered Nov 03 '22 00:11

Esailija


Chances are the text will be UTF8.

File.ReadAllText(filePath, Encoding.UTF8)

coupled with

File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)

should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

like image 42
Steve Py Avatar answered Nov 02 '22 23:11

Steve Py