In C, I'm working on a "class" that manages a byte buffer, allowing arbitrary data to be appended to the end. I'm now looking into automatic resizing as the underlying array fills up using calls to realloc
. This should make sense to anyone who's ever used Java or C# StringBuilder
. I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?
Obviously, there's a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I've seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?
Does any one know what Java or C# does under the hood?
In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.
There are three basic strategies for solving this problem, and they have different performance characteristics.
The first basic strategy is:
This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let's say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, ... and therefore copying 1000 + 2000 + 3000 + 4000 + ... + 999000 characters, which sums to on the order of 500 billion characters copied!
This strategy has the nice property that the amount of "wasted" memory is bounded by k.
In practice this strategy is seldom used because of that n-squared problem.
The second basic strategy is
k% is usually 100%; if it is then this is called the "double when full" strategy.
This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, ... and end up copying 1000 + 2000 + 4000 + 8000 ... + 512000 characters, which sums to about a million characters copied; much better.
The strategy has the property that the amortized cost is linear no matter what percentage you choose.
This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.
The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.
This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don't need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.
The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.
You generally want to keep the growth factor a little smaller than the golden mean (~1.6). When it's smaller than the golden mean, the discarded segments will be large enough to satisfy a later request, as long as they're adjacent to each other. If your growth factor is larger than the golden mean, that can't happen.
I've found that reducing the factor to 1.5 still works quite nicely, and has the advantage of being easy to implement in integer math (size = (size + (size << 1))>>1;
-- with a decent compiler you can write that as (size * 3)/2
, and it should still compile to fast code).
I seem to recall a conversation some years ago on Usenet, in which P.J. Plauger (or maybe it was Pete Becker) of Dinkumware, saying they'd run rather more extensive tests than I ever did, and reached the same conclusion (so, for example, the implementation of std::vector
in their C++ standard library uses 1.5).
When working with expanding and contracting buffers, the key property you want is to grow or shrink by a multiple of your size, not a constant difference.
Consider the case where you have a 16 byte array, increasing its size by 128 bytes is overkill; however, if instead you had a 4096 byte array and increased it by only 128 bytes, you would end up copying a lot.
I was taught to always double or halve arrays. If you really have no hint as to the size or maximum, multiplying by two ensures that you have a lot of capacity for a long time, and unless you're working on a resource constrained system, allocating at most twice the space isn't too terrible. Additionally, keeping things in powers of two can let you use bit shifts and other tricks and the underlying allocation is usually in powers of two.
Does any one know what Java or C# does under the hood?
Have a look at the following link to see how it's done in Java's StringBuilder from JDK11, in particular, the ensureCapacityInternal method. https://java-browser.yawk.at/java/11/java.base/java/lang/AbstractStringBuilder.java#java.lang.AbstractStringBuilder%23ensureCapacityInternal%28int%29
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With