Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much to grow buffer in a StringBuilder-like C module?

In C, I'm working on a "class" that manages a byte buffer, allowing arbitrary data to be appended to the end. I'm now looking into automatic resizing as the underlying array fills up using calls to realloc. This should make sense to anyone who's ever used Java or C# StringBuilder. I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?

Obviously, there's a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I've seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?

Does any one know what Java or C# does under the hood?

like image 961
Brian McFarland Avatar asked Apr 17 '12 18:04

Brian McFarland


4 Answers

In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.

There are three basic strategies for solving this problem, and they have different performance characteristics.

The first basic strategy is:

  • Make an array of characters
  • When you run out of room, create a new array with k more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let's say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, ... and therefore copying 1000 + 2000 + 3000 + 4000 + ... + 999000 characters, which sums to on the order of 500 billion characters copied!

This strategy has the nice property that the amount of "wasted" memory is bounded by k.

In practice this strategy is seldom used because of that n-squared problem.

The second basic strategy is

  • Make an array
  • When you run out of room, create a new array with k% more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

k% is usually 100%; if it is then this is called the "double when full" strategy.

This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, ... and end up copying 1000 + 2000 + 4000 + 8000 ... + 512000 characters, which sums to about a million characters copied; much better.

The strategy has the property that the amortized cost is linear no matter what percentage you choose.

This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.

The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.

This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don't need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.

The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.

like image 61
Eric Lippert Avatar answered Nov 12 '22 15:11

Eric Lippert


You generally want to keep the growth factor a little smaller than the golden mean (~1.6). When it's smaller than the golden mean, the discarded segments will be large enough to satisfy a later request, as long as they're adjacent to each other. If your growth factor is larger than the golden mean, that can't happen.

I've found that reducing the factor to 1.5 still works quite nicely, and has the advantage of being easy to implement in integer math (size = (size + (size << 1))>>1; -- with a decent compiler you can write that as (size * 3)/2, and it should still compile to fast code).

I seem to recall a conversation some years ago on Usenet, in which P.J. Plauger (or maybe it was Pete Becker) of Dinkumware, saying they'd run rather more extensive tests than I ever did, and reached the same conclusion (so, for example, the implementation of std::vector in their C++ standard library uses 1.5).

like image 22
Jerry Coffin Avatar answered Nov 12 '22 17:11

Jerry Coffin


When working with expanding and contracting buffers, the key property you want is to grow or shrink by a multiple of your size, not a constant difference.

Consider the case where you have a 16 byte array, increasing its size by 128 bytes is overkill; however, if instead you had a 4096 byte array and increased it by only 128 bytes, you would end up copying a lot.

I was taught to always double or halve arrays. If you really have no hint as to the size or maximum, multiplying by two ensures that you have a lot of capacity for a long time, and unless you're working on a resource constrained system, allocating at most twice the space isn't too terrible. Additionally, keeping things in powers of two can let you use bit shifts and other tricks and the underlying allocation is usually in powers of two.

like image 2
Michael Avatar answered Nov 12 '22 15:11

Michael


Does any one know what Java or C# does under the hood?

Have a look at the following link to see how it's done in Java's StringBuilder from JDK11, in particular, the ensureCapacityInternal method. https://java-browser.yawk.at/java/11/java.base/java/lang/AbstractStringBuilder.java#java.lang.AbstractStringBuilder%23ensureCapacityInternal%28int%29

like image 1
Rich Drummond Avatar answered Nov 12 '22 15:11

Rich Drummond