Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split an single-use large IEnumerable<T> in half using a condition

Let's say we have a Foo class:

public class Foo
{
    public DateTime Timestamp { get; set; }
    public double Value { get; set; }

    // some other properties

    public static Foo CreateFromXml(Stream str)
    {
        Foo f = new Foo();

        // do the parsing

        return f;
    }

    public static IEnumerable<Foo> GetAllTheFoos(DirectoryInfo dir)
    {
        foreach(FileInfo fi in dir.EnumerateFiles("foo*.xml", SearchOption.TopDirectoryOnly))
        {
            using(FileStream fs = fi.OpenRead())
                yield return Foo.CreateFromXML(fs);
        }
    }
}

For you to gain perspective, I can say that data in these files has been recorded for about 2 years at frequency of usually several Foo's every minute.

Now: we have a parameter called TimeSpan TrainingPeriod which is about 15 days for example. What I'd like to accomplish is to call:

var allTheData = GetAllTheFoos(myDirectory);

and obtain IEnumerable<Foo> TrainingSet, TestSet of it, where TrainingSet consists of the Foos from the first 15 days of recording, and the TestSet of all the rest. Then, out of the TrainingSet we want to calculate some constant-memory data (like average Value, some linear regression etc.), and then start consuming the TestSet, using the calculated values. In other words, my code should semantically be equvalent to:

TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0); // hope it says 15 days

var allTheData = GetAllTheFoos(myDirectory);
List<Foo> allTheDataList = allTheData.ToList();

var threshold = allTheDataList[0].Timestamp + TrainingPeriod;

List<Foo> TrainingSet = allTheDataList.Where(foo => foo.Timestamp < threshold).ToList();
List<Foo> TestSet = allTheDataList.Where(foo => foo.Timestamp >= threshold).ToList();

By the way the XML file naming convention ensures me, that Foos will be returned in chronological order. Of course, I do not want to store it all in memory, which happens every time .ToList() is called. So I came up with another solution:

TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0);

var allTheData = GetAllTheFoos(myDirectory);

var threshold = allTheDataList.First().Timestamp + TrainingPeriod; // a minor issue

var grouped = from foo in allTheData
              group foo by foo.Timestamp < Training;

var TrainingSet = grouped.First(g => g.Key);
var TestSet = grouped.First(g => !g.Key); // the major one

However, there is a minor and a major issue about that piece of code. The minor one is that the first file is read twice at least - doesn't matter actually. But it looks like TrainingSet and TestSet access the directory independently, read every file twice and select only those holding a particular timestamp constraint. I'm not too puzzled by that - in fact if it worked I would be puzzled and would have to rethink LINQ once again. But this raises file-access issues, and every file is parsed two times, which is a total waste of CPU time.

So my question is: can I achieve this effect using only simple LINQ/C# tools? I think I can do this in a good ol' brute-force way, overriding some GetEnumerator(), MoveNext() methods and so on - please don't bother typing it, I can totally handle this on my own.

However, if there is some elegant, short&sweet solution to this, it would be highly appreciated.

Thank you!

Another edit:

The code I finally came up is the following:

public static void Handle(DirectoryInfo dir)
{
    var allTheData = Foo.GetAllTheFoos(dir);

    var it = allTheData.GetEnumerator();

    it.MoveNext();

    TimeSpan trainingRange = new TimeSpan(15, 0, 0, 0);

    DateTime threshold = it.Current.Timestamp + trainingRange;

    double sum = 0.0;
    int count = 0;

    while(it.Current.Timestamp <= threshold)
    {
        sum += it.Current.Value;
        count++;

        it.MoveNext();
    }

    double avg = sum / (double)count;

    // now I can continue on with the 'it' IEnumerator
}

Of course still some minor issues are present i.e. veryfying the output of MoveNext() (is it end of the IEnumerable already?), but the general idea is clear I hope. BUT in the real code it's not just average I'm calculating, but different kinds of regression etc. So I'd like to somehow extract the first part, pass it as an IEnumerable to a class deriving from my

public abstract class AbstractAverageCounter
{
    public abstract void Accept(IEnumerable<Foo> theData);
    public AverageCounterResult Result { get; protected set; }
}

to separate responsibilities for extraction of training data and it's processing. Plus after the process depicted before I get an IEnumerator<Foo>, but I think an IEnumerable<Foo> would be preferred to pass it to my TheRestOfTheDataHandler instance.

like image 959
Wojciech Kozaczewski Avatar asked Feb 16 '15 13:02

Wojciech Kozaczewski


1 Answers

You can try to imlement a stateful iterator pattern over the ienumerator obtained from the initial ienumerable.

IEnumerable<T> StatefulTake(IEnumerator<T> source, Func<bool> getDone, Action setDone);

This method just checks done, calls MoveNext, yields Current and updates done if movenext returned false.

Then you split your set with subsequent calls to this method and doing partial enumeratiin on that with following methods for example: TakeWhile Any First ... Then you can do any operations on top of that, but each of those must be enumerated to the end.

var source = GetThemAll();
using (var e = source.GetEnumerator()){
 bool done=!source.MoveNext();
 foreach(var i in StatefulTake(e, ()=>done,()=>done=true).TakeWhile(i=>i.Time<...)){
  //...
 }

 var theRestAverage = StatefulTake(e,()=>done,()=>done=true).Avg(i=>i.Score);
 //...
}

Its a pattern i use often in my async toolkit.

Update: fixed the signature of the StatefulTake method, it can not use a ref parameter. Also the initial call to MoveNext is necessary. The three kinds of done varable referencess and the method itself should be encapsulated in a context class.

like image 69
George Polevoy Avatar answered Oct 06 '22 01:10

George Polevoy