Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compression and utf8 encoding

Tags:

c#

encoding

utf-8

can someone tell me why I'm loosing information doing this process ? Some utf8 chars appears not decoded : "Biography":"\u003clink type=... or Steve Blunt \u0026 Marty Kelley but others do : "Name":"朱敬

// Creating a 64bit string containing gzip data
string bar;
using (MemoryStream ms = new MemoryStream())
{
    using (GZipStream gzip = new GZipStream(ms, CompressionMode.Compress))
    using (StreamWriter writer = new StreamWriter(gzip, System.Text.Encoding.UTF8))
    {
        writer.Write(s);
    }
    ms.Flush();
    bar = Convert.ToBase64String(ms.ToArray());
}

// Reading it
string foo;
byte[] itemData = Convert.FromBase64String(bar);
using (MemoryStream src = new MemoryStream(itemData))
using (GZipStream gzs = new GZipStream(src, CompressionMode.Decompress))
using (MemoryStream dest = new MemoryStream(itemData.Length*2))
{
    gzs.CopyTo(dest);
    foo = Encoding.UTF8.GetString(dest.ToArray());
}

Console.WriteLine(foo);
like image 214
deKajoo Avatar asked May 28 '14 09:05

deKajoo


People also ask

Is encoding and compression the same thing?

Video encoding is the process of compressing and potentially changing the format of video content, sometimes even changing an analog source to a digital one. In regards to compression, the goal is so that it consumes less space. This is because it's a lossy process that throws away information related to the video.

What is UTF-8 encoding used for?

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.

Is Unicode compressed?

Unicode has defined a Standard Compression Scheme for Unicode (SCSU). It is a compact encoding that stores most text with one byte per character, or two for CJK.

What type of encoding is UTF-8?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.


1 Answers

It could be because you are writing the string using StreamWriter but reading it using CopyTo() and Encoding.GetString().

What happens if you try this?

// Reading it
string foo;
byte[] itemData = Convert.FromBase64String(bar);
using (MemoryStream src = new MemoryStream(itemData))
using (GZipStream gzs = new GZipStream(src, CompressionMode.Decompress))
using (StreamReader reader = new StreamReader(gzs, Encoding.UTF8))
{
    foo = reader.ReadLine();
}

Although I think you should be using BinaryReader and BinaryWriter:

string s = "Biography:\u003clink type...";
string bar;
using (MemoryStream ms = new MemoryStream())
{
    using (GZipStream gzip = new GZipStream(ms, CompressionMode.Compress))
    using (var writer = new BinaryWriter(gzip, Encoding.UTF8))
    {
        writer.Write(s);
    }
    ms.Flush();
    bar = Convert.ToBase64String(ms.ToArray());
}

// Reading it
string foo;
byte[] itemData = Convert.FromBase64String(bar);
using (MemoryStream src = new MemoryStream(itemData))
using (GZipStream gzs = new GZipStream(src, CompressionMode.Decompress))
using (var reader = new BinaryReader(gzs, Encoding.UTF8))
{
    foo = reader.ReadString();
}

Console.WriteLine(foo);
like image 181
Matthew Watson Avatar answered Sep 22 '22 19:09

Matthew Watson