I have a function that I use a lot and hence the performance needs to be as good as possible. It takes data from excel and then sums, averages or counts over parts of the data based on whether the data is within a certain period and whether it is a peak hour (Mo-Fr 8-20).
The data is usually around 30,000 rows and 2 columns (hourly date, value). One important feature of the data is that the date column is chronologically ordered
I have three implementations, c# with extension methods (dead slow and I m not going to show it unless somebody is interested).
Then I have this f# implementation:
let ispeak dts =
let newdts = DateTime.FromOADate dts
match newdts.DayOfWeek, newdts.Hour with
| DayOfWeek.Saturday, _ | DayOfWeek.Sunday, _ -> false
| _, h when h >= 8 && h < 20 -> true
| _ -> false
let internal isbetween a std edd =
match a with
| r when r >= std && r < edd+1. -> true
| _ -> false
[<ExcelFunction(Name="aggrF")>]
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Seq.averageBy (fun (_, value) -> value)
| 2 -> newd |> Seq.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Seq.sumBy (fun (_, value) -> value)
I see two issues with this:
Here comes now what I would call a brute force imperative c# version:
public static bool ispeak(double dats)
{
var dts = System.DateTime.FromOADate(dats);
if (dts.DayOfWeek != DayOfWeek.Sunday & dts.DayOfWeek != DayOfWeek.Saturday & dts.Hour > 7 & dts.Hour < 20)
return true;
else
return false;
}
[ExcelFunction(Description = "Aggregates HFC/EG into average or sum over period, start date inclusive, end date exclusive")]
public static double aggrI(double[] dts, double[] vals, double std, double edd, string pob, double sumavg)
{
double accsum = 0;
int acccounter = 0;
int indicator = 0;
bool peakbool = pob.Equals("Peak", StringComparison.OrdinalIgnoreCase);
bool offpeakbool = pob.Equals("Offpeak", StringComparison.OrdinalIgnoreCase);
bool basebool = pob.Equals("Base", StringComparison.OrdinalIgnoreCase);
for (int i = 0; i < vals.Length; ++i)
{
if (dts[i] >= std && dts[i] < edd + 1)
{
indicator = 1;
if (peakbool && ispeak(dts[i]))
{
accsum += vals[i];
++acccounter;
}
else if (offpeakbool && (!ispeak(dts[i])))
{
accsum += vals[i];
++acccounter;
}
else if (basebool)
{
accsum += vals[i];
++acccounter;
}
}
else if (indicator == 1)
{
break;
}
}
if (sumavg == 0)
{
return accsum / acccounter;
}
else if (sumavg == 2)
{
return acccounter;
}
else
{
return accsum;
}
}
This is much faster (I m guessing mainly because of the exit of loop when period ended) but oviously less succinct.
My questions:
Is there a way to stop iterations in the f# Seq module for sorted series?
Is there another way to speed up the f# version?
can somebody think of an even better way of doing this? Thanks a lot!
Update: Speed comparison
I set up a test array with hourly dates from 1/1/13-31/12/15 (roughly 30,000 rows) and corresponding values. I made 150 calls spread out over the date array and repeated this 100 times - 15000 function calls:
My csharp implementation above (with string.compare outside of loop)
1.36 secs
Matthews recursion fsharp
1.55 secs
Tomas array fsharp
1m40secs
My original fsharp
2m20secs
Obviously this is always subjective to my machine but gives an idea and people asked for it...
I also think one should keep in mind this doesnt mean recursion or for loops are always faster than array.map etc, just in this case it does a lot of unnecessary iterations as it doesnt have the early exit from iterations that the c# and the f# recursion method have
Using Array
instead of List
and Seq
makes this about 3-4 times faster. You do not need to generate a list of indices and then map over that to lookup items in the two arrays - instead you can use Array.zip
to combine the two arrays into a single one and then use Array.filter
.
In general, if you want performance, then using array as your data structure will make sense (unless you have a long pipeline of things). Functions like Array.zip
and Array.map
can calculate the entire array size, allocate it and then do efficient imperative operation (while still looking functional from the outside).
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
Array.zip data data2
|> Array.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Array.averageBy (fun (_, value) -> value)
| 2 -> newd |> Array.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Array.sumBy (fun (_, value) -> value)
I also changed isbetween
- it can be simplified into just an expression and you can mark it inline
, but that does not add that much:
let inline isbetween r std edd = r >= std && r < edd+1.
Just for completeness, I tested this with the following code (using F# Interactive):
#time
let d1 = Array.init 1000000 float
let d2 = Array.init 1000000 float
aggrF d1 d2 0.0 1000000.0 "Test" 0
The original version was about ~600ms and the new version using arrays takes between 160ms and 200ms. The version by Matthew takes about ~520ms.
Aside, I spent the last two months at BlueMountain Capital working on a time series/data frame library for F# that would make this a lot simpler. It is work in progress and also the name of the library will change, but you can find it in BlueMountain GitHub. The code would look something like this (it uses the fact that the time series is ordered and uses slicing to get the relevant part before filtering):
let ts = Series(times, values)
ts.[std .. edd] |> Series.filter (fun k _ -> not (ispeak k)) |> Series.mean
Currently, this will not be as fast as direct array operations, but I'll look into that :-).
An immediate way to speed it up would be to combine these:
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
into a single list comprehension, and also as the other matthew said, do a single string comparison:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let data = [ for i in 0 .. (Array.length data) - 1 do
let (date, value) = (data.[i], data2.[i])
if isbetween date std edd && isValidTime date then
yield (date, value)
else
() ]
match sac with
| 0 -> data |> Seq.averageBy (fun (_, value) -> value)
| 2 -> data.Length
| _ -> data |> Seq.sumBy (fun (_, value) -> value)
Or use a tail recursive function:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let endDate = edd + 1.0
let rec aggr i sum count =
if i >= (Array.length data) || data.[i] >= endDate then
match sac with
| 0 -> sum / float(count)
| 2 -> float(count)
| _ -> float(sum)
else if data.[i] >= std && isValidTime data.[i] then
aggr (i + 1) (sum + data2.[i]) (count + 1)
else
aggr (i + 1) sum count
aggr 0 0.0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With