I have an application that uses a large amount of strings. So I have some problem of memory usage. I know that one of the best solution in this case is to use a DB, but I cannot use this for the moment, so I am looking for others solutions.
In C# string are store in Utf16, that means I lost half of the memory usage compare to Utf8 (for the major part of my strings). So I decided to use byte array of utf8 string. But to my surprise this solution took twice more memory space than simple strings in my application.
So I have done some simple test, but I want to know the opinion of experts to be sure.
Test 1 : Fixed length strings allocation
var stringArray = new string[10000];
var byteArray = new byte[10000][];
var Sb = new StringBuilder();
var utf8 = Encoding.UTF8;
var stringGen = new Random(561651);
for (int i = 0; i < 10000; i++) {
    for (int j = 0; j < 10000; j++) {
        Sb.Append((stringGen.Next(90)+32).ToString());
    }
    stringArray[i] = Sb.ToString();
    byteArray[i] = utf8.GetBytes(Sb.ToString());
    Sb.Clear();
}
GC.Collect();
GC.WaitForFullGCComplete(5000);
Memory Usage
00007ffac200a510        1        80032 System.Byte[][]
00007ffac1fd02b8       56       152400 System.Object[]
000000bf7655fcf0      303      3933750      Free
00007ffac1fd5738    10004    224695091 System.Byte[]
00007ffac1fcfc40    10476    449178396 System.String
As we can see, bytes arrays take twice less memory space, no real surprise here.
Test 2 : Random size string allocation (with a realistic length)
var stringArray = new string[10000];
var byteArray = new byte[10000][];
var Sb = new StringBuilder();
var utf8 = Encoding.UTF8;
var lengthGen = new Random(2138784);
for (int i = 0; i < 10000; i++) {
    for (int j = 0; j < lengthGen.Next(100); j++) {
        Sb.Append(i.ToString());
        stringArray[i] = Sb.ToString();
        byteArray[i] = utf8.GetBytes(Sb.ToString());
    }
    Sb.Clear();
}
GC.Collect();
GC.WaitForFullGCComplete(5000);
Memory Usage
00007ffac200a510        1        80032 System.Byte[][]
000000be2aa8fd40       12        82784      Free
00007ffac1fd02b8       56       152400 System.Object[]
00007ffac1fd5738     9896       682260 System.Byte[]
00007ffac1fcfc40    10368      1155110 System.String
String takes a little less space than twice time the memory space of byte array. With shorter string I was expecting a greater overhead for strings. But it seems that the opposite is, why?
Test 3 : String model corresponding to my application
var stringArray = new string[10000];
var byteArray = new byte[10000][];
var Sb = new StringBuilder();
var utf8 = Encoding.UTF8;
var lengthGen = new Random();
for (int i=0; i < 10000; i++) {
    if (i%2 == 0) {
        for (int j = 0; j < lengthGen.Next(100000); j++) {
            Sb.Append(i.ToString());
            stringArray[i] = Sb.ToString();
            byteArray[i] = utf8.GetBytes(Sb.ToString());
            Sb.Clear();
        }
    } else {
        stringArray[i] = Sb.ToString();
        byteArray[i] = utf8.GetBytes(Sb.ToString());
        Sb.Clear();
    }
}
GC.Collect();
GC.WaitForFullGCComplete(5000);
Memory Usage
00007ffac200a510        1        80032 System.Byte[][]
00007ffac1fd02b8       56       152400 System.Object[]
00007ffac1fcfc40     5476       198364 System.String
00007ffac1fd5738    10004       270075 System.Byte[]
Here strings take much less memory space than byte. This can be surprising, but I supposed that empty string are referenced only once. Is it? But I don't know if this can explain all that huge difference. Is it any other reason? What is the best solution?
Strings are faster for searches (contains, index, compare) purpose. bytes are faster in create (replace, concat) purpose.
The string array is just an array of references - an array of size N will take approximately (N * 4 + 20) or (N * 8 + 20) bytes depending on the size of a reference in your JVM.
A byte in Go is an unsigned 8-bit integer. It has type uint8 . A byte has a limit of 0 – 255 in numerical range. It can represent an ASCII character.
Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.
This can be surprising, but I supposed that empty string are referenced only once.
Yes, an empty StringBuilder returns string.Empty as its result. The code snippet below prints True:
var sb = new StringBuilder();
Console.WriteLine(object.ReferenceEquals(sb.ToString(), string.Empty));
But I don't know if this can explain all that huge difference.
Yes, this perfectly explains it. You are saving on 5,000 string objects. The difference in bytes is roughly 270,000-(198,000/2), so about 170 kBytes. Dividing by 5 you get 34 bytes per object, which is roughly the size of a pointer on a 32-bit system.
What is the best solution?
Do the same thing: make yourself a private static readonly empty array, and use it each time that you get string.Empty from sb.ToString():
private static readonly EmptyBytes = new byte[0];
...
else
{
    stringArray[i] = Sb.ToString();
    if (stringArray[i] == string.Empty) {
        byteArray[i] = EmptyBytes;
    } else {
        byteArray[i] = utf8.GetBytes(Sb.ToString());
    }
    Sb.Clear();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With