Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Managed memory leaked by C# iterator

I have a class that generates DNA sequences, that are represented by long strings. This class implements the IEnumerable<string> interface, and it can produce an infinite number of DNA sequences. Below is a simplified version of my class:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerable<string> _enumerable;

    public DnaGenerator() => _enumerable = Iterator();

    private IEnumerable<string> Iterator()
    {
        while (true)
            foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
                yield return new String(c, 10_000_000);
    }

    public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

This class generates the DNA sequences by using an iterator. Instead of invoking the iterator again and again, an IEnumerable<string> instance is created during the construction and is cached as a private field. The problem is that using this class results in a sizable chunk of memory being constantly allocated, with the garbage collector being unable to recycle this chunk. Here is a minimal demonstration of this behavior:

var dnaGenerator = new DnaGenerator();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);

static void DoWork(DnaGenerator dnaGenerator)
{
    foreach (string dna in dnaGenerator.Take(5))
    {
        Console.WriteLine($"Processing DNA of {dna.Length:#,0} nucleotides" +
            $", starting from {dna[0]}");
    }
}

Output:

TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes

Try it on Fiddle.

My expectation was that all generated DNA sequences would be eligible for garbage collection, since they are not referenced by my program. The only reference that I hold is the reference to the DnaGenerator instance itself, which is not meant to contain any sequences. This component just generates the sequences. Nevertheless, no matter how many or how few sequences my program generates, there are always around 20 MB of memory allocated after a full garbage collection.

My question is: Why is this happening? And how can I prevent this from happening?

.NET 6.0, Windows 10, 64-bit operating system, x64-based processor, Release built.


Update: The problem disappears if I replace this:

public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();

...with this:

public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();

But I am not a fan of creating a new enumerable each time an enumerator is needed. My understanding is that a single IEnumerable<T> can create many IEnumerator<T>s. AFAIK these two interfaces are not meant to have an one-to-one relationship.

like image 449
Theodor Zoulias Avatar asked Apr 26 '26 03:04

Theodor Zoulias


2 Answers

The problem is caused by the auto-generated implementation for the code using yield.

You can mitigate this somewhat by explicitly implementing the enumerator.

You have to fiddle it a bit by calling .Reset() from public IEnumerator<string> GetEnumerator() to ensure the enumeration restarts at each call:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerator<string> _enumerable;

    public DnaGenerator() => _enumerable = new IteratorImpl();

    sealed class IteratorImpl : IEnumerator<string>
    {
        public bool MoveNext()
        {
            return true; // Infinite sequence.
        }

        public void Reset()
        {
            _index = 0;
        }

        public string Current
        {
            get
            {
                var result = new String(_data[_index], 10_000_000);

                if (++_index >= _data.Length)
                    _index = 0;

                return result;
            }
        }

        public void Dispose()
        {
            // Nothing to do.
        }

        readonly char[] _data = { 'A', 'C', 'G', 'T' };

        int _index;

        object IEnumerator.Current => Current;
    }

    public IEnumerator<string> GetEnumerator()
    {
        _enumerable.Reset();
        return _enumerable;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
like image 197
Matthew Watson Avatar answered Apr 27 '26 21:04

Matthew Watson


Note that 10_000_000 of chars (which are 16 bit) will take approximately 20 MB. If you will take a look at the decompilation you will notice that yeild return results in internal <Iterator> class generated which in turn has a current field to store the string (to implement IEnumerator<string>.Current):

[CompilerGenerated]
private sealed class <Iterator>d__2 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
​    ...
    private string <>2__current;
    ...
}

And Iterator method internally will be compiled to something like this:

[IteratorStateMachine(typeof(<Iterator>d__2))]
private IEnumerable<string> Iterator()
{
    return new <Iterator>d__2(-2);
}

Which leads to the current string always being stored in memory for _enumerable.GetEnumerator(); implementation (after iteration start) while DnaGenerator instance is not GCed itself.

UPD

My understanding is that a single IEnumerable can create many IEnumerators. AFAIK these two interfaces are not meant to have an one-to-one relationship.

Yes, in case of generated for yield return enumerable it can create multiple enumerators, but in this particular case the implementation have "one-to-one" relationship because the generated implementation is both IEnumerable and IEnumerator:

private sealed class <Iterator>d__2 : 
    IEnumerable<string>, IEnumerable,
    IEnumerator<string>, IEnumerator, 
    IDisposable

But I am not a fan of creating a new enumerable each time an enumerator is needed.

But it is actually what is happening when you call _enumerable.GetEnumerator() (which is obviously an implementation detail), if you check already mentioned decompilation you will see that _enumerable = Iterator() is actually new <Iterator>d__2(-2) and <Iterator>d__2.GetEnumerator() looks something like this:

IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
    if (<>1__state == -2 && <>l__initialThreadId == Environment.CurrentManagedThreadId)
    {
        <>1__state = 0;
        return this;
    }
    return new <Iterator>d__2(0);
}

So it actually should create a new iterator instance every time except the first enumeration, so your public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator(); approach is just fine.

like image 25
Guru Stron Avatar answered Apr 27 '26 20:04

Guru Stron



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!