Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I reshape my data before I turn it into a histogram?

Imagine that I have a visits index that contains documents of type 'visit' that look like the following:

{
               "id": "c223a991-b4e7-4333-ba45-a576010b568b",
// other properties
               "buildingId": "48da1a81-fa73-4d4f-aa22-a5750162ed1e",
               "arrivalDateTimeUtc": "2015-12-22T21:15:00Z"
}

The following function will return a histogram which returns buckets of visits for each day in the given range according to the given time zone.

    public Bucket<HistogramItem> Execute(MyParameterType parameters)
    {
        var buildingFilter = Filter<VisitProjection>.Term(x => x.BuildingId, parameters.BuildingId);
        var dateFilter = Filter<VisitProjection>.Range(r => r
            .OnField(p => p.ArrivalDateTimeUtc)
            .GreaterOrEquals(parameters.EarliestArrivalDateTimeUtc)
            .LowerOrEquals(parameters.LatestArrivalDateTimeUtc)
        );

        var result = _elasticClient.Search<VisitProjection>(s => s
            .Index("visits")
            .Type("visit")
            .Aggregations(a => a
                .Filter("my_filter_agg", f => f
                    .Filter(fd => buildingFilter && dateFilter)
                        .Aggregations(ta => ta.DateHistogram("my_date_histogram", h => h
                            .Field(p => p.ArrivalDateTimeUtc)
                            .Interval(parameters.DateInterval) // "day"
                            .TimeZone(NodaTimeHelpers.WindowsToIana(parameters.TimeZoneInfo)) // This is a critical piece of the equation.
                            .MinimumDocumentCount(0)
                        )
                    )
                )
            )
        );

        return result.Aggs.Nested("my_filter_agg").DateHistogram("my_date_histogram");
    }
}

// Returns [{Date: 12/22/2015 12:00:00 AM, DocCount: 1}]

Now imagine that I changed things up a bit. Imagine that I added a new field to the document:

{
               "id": "c223a991-b4e7-4333-ba45-a576010b568b",
// other properties
               "buildingId": "48da1a81-fa73-4d4f-aa22-a5750162ed1e",
               "arrivalDateTimeUtc": "2015-12-22T21:15:00Z",
               "departureDateTimeUtc": "2015-12-23T22:00:00Z" // new property
}

And assume that I want to return the following:

// Returns [{Date: 12/22/2015 12:00:00 AM, DocCount: 1}, {Date: 12/23/2015 12:00:00 AM, DocCount: 1}]

because the visit spanned two days and I want a date histogram which records one unit for each day that a visit spans.

How would I do this with NEST/Elastic Search?


Note 1: Unless someone convinces me otherwise, I don't think would be a good idea to collect all documents in the range and perform the aggregation/bucketization and date histogram in the middle-tier (or C# layer).

Note 2: The time zone aspect of this problem is critical because I need the counts to be bucketized according to the given time zone.

like image 871
Jim G. Avatar asked Mar 17 '16 18:03

Jim G.


2 Answers

One way could be to use the scripted_metric aggregation and perform the bucketing yourself based on your two date fields. Pretty convoluted and not really performant depending on how many docs you have.

However, another easier solution could be to use a single date field and put all dates of the interval into an array (arrival first, departure last and all other dates in between), like this:

{
     "id": "c223a991-b4e7-4333-ba45-a576010b568b",
     "buildingId": "48da1a81-fa73-4d4f-aa22-a5750162ed1e",
     "visitDateTimeUtc": ["2015-12-22T21:15:00Z", "2015-12-23T22:00:00Z" ]
}

And if a visit spans three/four/etc days, you can just "fill" the array with the days between arrival and departure

{
     "id": "c223a991-b4e7-4333-ba45-a576010b568b",
     "buildingId": "48da1a81-fa73-4d4f-aa22-a5750162ed1e",
     "visitDateTimeUtc": ["2015-12-22T21:15:00Z", "2015-12-23T22:00:00Z", "2015-12-24T22:00:00Z", "2015-12-25T22:00:00Z" ]
}

By doing so, all dates of the interval will be taken into account by your date_histogram aggregation.

like image 67
Val Avatar answered Sep 19 '22 16:09

Val


I would consider solving this problem by having a new array property on your ES model for VisitDays, so if someone stayed from 01-jan-2015 until 05-jan-2015 then your model would be something like this:

{
    "id" : "c223a991-b4e7-4333-ba45-a576010b568b",
    // other properties
    "buildingId" : "48da1a81-fa73-4d4f-aa22-a5750162ed1e",
    "arrivalDateTimeUtc" : "2015-01-01T21:15:00Z",
    "departureDateTimeUtc" : "2015-01-05T22:00:00Z", // new property
    "visitDays" : [
        "2015-01-01",
        "2015-01-02",
        "2015-01-03",
        "2015-01-04",
        "2015-01-05"
    ]
}

If you did that, then your bucketing would be very easy and very fast. A scripted field would not be fast. And you're 100% correct that trying to do this in C# would not work because it would be way too slow.

like image 31
jhilden Avatar answered Sep 21 '22 16:09

jhilden