Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregation function - f# vs c# performance

Tags:

f#

I have a function that I use a lot and hence the performance needs to be as good as possible. It takes data from excel and then sums, averages or counts over parts of the data based on whether the data is within a certain period and whether it is a peak hour (Mo-Fr 8-20).

The data is usually around 30,000 rows and 2 columns (hourly date, value). One important feature of the data is that the date column is chronologically ordered

I have three implementations, c# with extension methods (dead slow and I m not going to show it unless somebody is interested).

Then I have this f# implementation:

let ispeak dts =
    let newdts = DateTime.FromOADate dts
    match newdts.DayOfWeek, newdts.Hour with
    | DayOfWeek.Saturday, _ | DayOfWeek.Sunday, _ -> false
    | _, h when h >= 8 && h < 20 -> true
    | _ -> false

let internal isbetween a std edd =
    match a with
    | r when r >= std && r < edd+1. -> true
    | _ -> false

[<ExcelFunction(Name="aggrF")>]
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
    let newd =
        [0 .. (Array.length data) - 1]
        |> List.map (fun i -> (data.[i], data2.[i])) 
        |> Seq.filter (fun (date, _) -> 
            let dateInRange = isbetween date std edd
            match pob with
            | "Peak" -> ispeak date && dateInRange
            | "Offpeak" -> not(ispeak date) && dateInRange
            | _ -> dateInRange)
   match sac with 
   | 0 -> newd |> Seq.averageBy (fun (_, value) -> value)
   | 2 -> newd |> Seq.sumBy (fun (_, value) -> 1.0)
   | _ -> newd |> Seq.sumBy (fun (_, value) -> value)

I see two issues with this:

  1. I need to prepare the data because both date and value are double[]
  2. I do not utilize the knowledge that dates are chronological hence I do unnecessary iterations.

Here comes now what I would call a brute force imperative c# version:

        public static bool ispeak(double dats)
    {
        var dts = System.DateTime.FromOADate(dats);
        if (dts.DayOfWeek != DayOfWeek.Sunday & dts.DayOfWeek != DayOfWeek.Saturday & dts.Hour > 7 & dts.Hour < 20)
            return true;
        else
            return false;
    }

    [ExcelFunction(Description = "Aggregates HFC/EG into average or sum over period, start date inclusive, end date exclusive")]
    public static double aggrI(double[] dts, double[] vals, double std, double edd, string pob, double sumavg)
    {
        double accsum = 0;
        int acccounter = 0;
        int indicator = 0;
        bool peakbool = pob.Equals("Peak", StringComparison.OrdinalIgnoreCase);
        bool offpeakbool = pob.Equals("Offpeak", StringComparison.OrdinalIgnoreCase);
        bool basebool = pob.Equals("Base", StringComparison.OrdinalIgnoreCase);


        for (int i = 0; i < vals.Length; ++i)
        {
            if (dts[i] >= std && dts[i] < edd + 1)
            {
                indicator = 1;
                if (peakbool && ispeak(dts[i]))
                {
                    accsum += vals[i];
                    ++acccounter;
                }
                else if (offpeakbool && (!ispeak(dts[i])))
                {
                    accsum += vals[i];
                    ++acccounter;
                }
                else if (basebool)
                {
                    accsum += vals[i];
                    ++acccounter;
                }
            }
            else if (indicator == 1)
            {
                break;
            }
        }

        if (sumavg == 0)
        {
            return accsum / acccounter;
        }
        else if (sumavg == 2)
        {
            return acccounter;
        }
        else
        {
            return accsum;
        }
    }

This is much faster (I m guessing mainly because of the exit of loop when period ended) but oviously less succinct.

My questions:

  1. Is there a way to stop iterations in the f# Seq module for sorted series?

  2. Is there another way to speed up the f# version?

  3. can somebody think of an even better way of doing this? Thanks a lot!

Update: Speed comparison

I set up a test array with hourly dates from 1/1/13-31/12/15 (roughly 30,000 rows) and corresponding values. I made 150 calls spread out over the date array and repeated this 100 times - 15000 function calls:

My csharp implementation above (with string.compare outside of loop)

1.36 secs

Matthews recursion fsharp

1.55 secs

Tomas array fsharp

1m40secs

My original fsharp

2m20secs

Obviously this is always subjective to my machine but gives an idea and people asked for it...

I also think one should keep in mind this doesnt mean recursion or for loops are always faster than array.map etc, just in this case it does a lot of unnecessary iterations as it doesnt have the early exit from iterations that the c# and the f# recursion method have

like image 449
nik Avatar asked Dec 15 '22 05:12

nik


2 Answers

Using Array instead of List and Seq makes this about 3-4 times faster. You do not need to generate a list of indices and then map over that to lookup items in the two arrays - instead you can use Array.zip to combine the two arrays into a single one and then use Array.filter.

In general, if you want performance, then using array as your data structure will make sense (unless you have a long pipeline of things). Functions like Array.zip and Array.map can calculate the entire array size, allocate it and then do efficient imperative operation (while still looking functional from the outside).

let aggrF (data:float[]) (data2:float[]) std edd pob sac =
    let newd =
        Array.zip data data2 
        |> Array.filter (fun (date, _) -> 
            let dateInRange = isbetween date std edd
            match pob with
            | "Peak" -> ispeak date && dateInRange
            | "Offpeak" -> not(ispeak date) && dateInRange
            | _ -> dateInRange)
    match sac with 
    | 0 -> newd |> Array.averageBy (fun (_, value) -> value)
    | 2 -> newd |> Array.sumBy (fun (_, value) -> 1.0)
    | _ -> newd |> Array.sumBy (fun (_, value) -> value)

I also changed isbetween - it can be simplified into just an expression and you can mark it inline, but that does not add that much:

let inline isbetween r std edd = r >= std && r < edd+1.

Just for completeness, I tested this with the following code (using F# Interactive):

#time 
let d1 = Array.init 1000000 float
let d2 = Array.init 1000000 float
aggrF d1 d2 0.0 1000000.0 "Test" 0

The original version was about ~600ms and the new version using arrays takes between 160ms and 200ms. The version by Matthew takes about ~520ms.

Aside, I spent the last two months at BlueMountain Capital working on a time series/data frame library for F# that would make this a lot simpler. It is work in progress and also the name of the library will change, but you can find it in BlueMountain GitHub. The code would look something like this (it uses the fact that the time series is ordered and uses slicing to get the relevant part before filtering):

let ts = Series(times, values)
ts.[std .. edd] |> Series.filter (fun k _ -> not (ispeak k)) |> Series.mean

Currently, this will not be as fast as direct array operations, but I'll look into that :-).

like image 180
Tomas Petricek Avatar answered Dec 29 '22 22:12

Tomas Petricek


An immediate way to speed it up would be to combine these:

[0 .. (Array.length data) - 1]
    |> List.map (fun i -> (data.[i], data2.[i])) 
    |> Seq.filter (fun (date, _) -> 

into a single list comprehension, and also as the other matthew said, do a single string comparison:

let aggrF (data:float[]) (data2:float[]) std edd pob sac =
    let isValidTime = match pob with
                        | "Peak" -> (fun x -> ispeak x)
                        | "Offpeak" -> (fun x -> not(ispeak x))
                        | _ -> (fun _ -> true)

    let data = [ for i in 0 .. (Array.length data) - 1 do 
                  let (date, value) = (data.[i], data2.[i])
                  if isbetween date std edd && isValidTime date then
                      yield (date, value)
                  else
                      () ]

    match sac with 
    | 0 -> data |> Seq.averageBy (fun (_, value) -> value)
    | 2 -> data.Length
    | _ -> data |> Seq.sumBy (fun (_, value) -> value)

Or use a tail recursive function:

let aggrF (data:float[]) (data2:float[]) std edd pob sac =
    let isValidTime = match pob with
                        | "Peak" -> (fun x -> ispeak x)
                        | "Offpeak" -> (fun x -> not(ispeak x))
                        | _ -> (fun _ -> true)

    let endDate = edd + 1.0

    let rec aggr i sum count =
        if i >= (Array.length data) || data.[i] >= endDate then
            match sac with 
            | 0 -> sum / float(count)
            | 2 -> float(count)
            | _ -> float(sum)
        else if data.[i] >= std && isValidTime data.[i] then
            aggr (i + 1) (sum + data2.[i]) (count + 1)
        else
            aggr (i + 1) sum count

    aggr 0 0.0 0
like image 37
Matthew Mcveigh Avatar answered Dec 29 '22 23:12

Matthew Mcveigh