Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding UTF8 C# Process

I have an application which process the vbscript and produces the output.

private static string processVB(string command, string arguments)
{
    Process Proc = new Process();
    Proc.StartInfo.UseShellExecute = false;
    Proc.StartInfo.RedirectStandardOutput = true;
    Proc.StartInfo.RedirectStandardError = true;
    Proc.StartInfo.RedirectStandardInput = true;
    Proc.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    Proc.StartInfo.StandardErrorEncoding = Encoding.UTF8;
    Proc.StartInfo.FileName = command;
    Proc.StartInfo.Arguments = arguments;
    Proc.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; //prevent console      window from popping up
    Proc.Start();
    string output = Proc.StandardOutput.ReadToEnd();
    string error = Proc.StandardError.ReadToEnd();

    if (String.IsNullOrEmpty(output) && !String.IsNullOrEmpty(error))
    {
        output = error;
    }
    //Console.Write(ping_output);

    Proc.WaitForExit();
    Proc.Close();

    return output;
}

I think I have set everything related to Encoding property correct. processVB method will get command as VBscript file and its arguments.

The C# method processVB which is processing that VBScript file now producing the output as follows.

"����?"

But I should get original text

"äåéö€"

I have set Encoding correctly. But I am not able to get it right.

What am I doing wrong?

like image 468
BinaryMee Avatar asked Mar 13 '14 13:03

BinaryMee


People also ask

What encoding does C use?

CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.

Does C use ASCII or Unicode?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.

What is the Unicode code for C?

Unicode Character “C” (U+0043)

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

This answer is not answering direct question - but I noticed a deadlock potential in your code and thus thought it would be worthy to post it anyhow.

The deadlock potential exists due to your code trying to do synchronous read from redirected output, and doing it for both, StdOut and StdErr. I.e. this section of the code.

Proc.Start();
string output = Proc.StandardOutput.ReadToEnd();
string error = Proc.StandardError.ReadToEnd();

...

Proc.WaitForExit();

What can happen is that child process writes a lot of data to StdErr and filling up the buffer. Once buffer gets filled up, the child process will block on the write to StdErr (without signaling yet end of StdOut stream). And so child is blocked and not doing anything, and your process is blocked waiting for child to exit. Deadlock!!!

To fix this, at least one (or better both) streams should be switched to asynchronous mode.

See second example in MSDN that talk specifically about this case scenario, and how to switch to asynchronous mode.

As for the UTF-8 issue, are you sure that your child process is outputting in this encoding and not say in UTF-16 or some other one? You may want to examine the bytes to try to reverse out what encoding stream is supplied in so you can set proper encoding for interpreting redirected stream.

EDIT

Here is how I think you can resolve the encoding issue. The basic idea is based on something that I once needed to do - I had Russian text in unknown encoding, and needed to figure out how to convert it so it shows proper characters - take the bytes captured from StdOut, and try to decode them using all known code pages available on the system. The one that looks right is likely (but not necessarily) the encoding that StdOut is encoded with. The reason it is not guaranteed to be the one even if it looks correct with your data is because many encoding have overlap over some ranges of bytes that would make it work the same. E.g. ASCII and UTF8 would have the same bytes when encoding basic Latin characters. So to get exact match, you may need to get creative and test with some atypical text.

Here is the basic code to do it - adjustments may be necessary:

    byte[] text = <put here bytes captured from StandardOut of child process>

    foreach(System.Text.EncodingInfo encodingInfo in System.Text.Encoding.GetEncodings())
    {
        System.Text.Encoding encoding = encodingInfo.GetEncoding();
        string decodedBytes = encoding.GetString(bytes);
        System.Console.Out.WriteLine("Encoding: {0}, Decoded Bytes: {1}", encoding.EncodingName, decodedBytes);
    }

Run the code and manually examine the output. All those that match the expected text are candidates for being the encoding used in StdOut.

like image 140
LB2 Avatar answered Sep 24 '22 11:09

LB2