Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in missing data from Queryset Django

I've inherited a AngularJS / Django application using DjangoRestFramework and a Postgres DB which is being re-platformed from AngularJS to React / Redux. One of the things we're trying to do is present time series data using amCharts4. A problem (among many others) we're having is presenting data over a time range for which there may be no entries in the DB. For example, we have results which might look something like:

[
    {
        "date": "2020-01-16T00:00:00.000Z",
        "result": 3
    },
    {
        "date": "2020-01-18T00:00:00.000Z",
        "result": 2
    }
]

And would like them to look something like:

[
    {
        "date": "2020-01-16T00:00:00.000Z",
        "result": 3
    },
    {
        "date": "2020-01-17T00:00:00.000Z",
        "result": 0
    },
    {
        "date": "2020-01-18T00:00:00.000Z",
        "result": 2
    }
]

Additionally, we also have data with multiple data points per time event:

[
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 1,
        "name": "Yes"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 1,
        "name": "No"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 1,
        "name": "No"
    }
]

And would like the data filled in with 0's for any name on any date where there isn't a result:

[
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 1,
        "name": "Yes"
    },
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 0,
        "name": "No"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 1,
        "name": "No"
    },
    {
        "date": "2020-01-15T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-15T00:00:00Z",
        "result": 0,
        "name": "No"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 1,
        "name": "No"
    }
]

The range of these results is also not necessarily governed by the start and end dates in the date but may be specified by the user. In this case, we would need to fill in zero value results for all options for all dates in those ranges.

I'm aware of amCharts skipEmptyPeriods property (amCharts4 - skipEmptyPeriods), but my frontend engineers have told me this won't work for the cases of multiple trend lines (i.e., the second case where there are multiple options per date). Also, this isn't really a frontend problem and will cause performance issues down the line.

Additionally I've tried using Postgresql's generate_series function with coalesce Postgresql - generate_series but wasn't able to get this to work for the second case.

Currently I'm trying this in Pandas (which I've never used) and have a solution to the first problem of single entries per date but, again, and having trouble with the second case of multiple entries per date:

    from_date = request.query_params.get("from_date")
    to_date = request.query_params.get("to_date")

    # let's do some zero plotting
    filtered_queryset = list(filtered_queryset)
    if from_date:
        from_date = datetime.strptime(from_date, "%Y-%m-%d").astimezone(pytz.UTC)
    else:
        from_date = filtered_queryset[0]["date"]
    if to_date:
        to_date = datetime.strptime(to_date, "%Y-%m-%d").astimezone(pytz.UTC)
        _now = localtime(now()).astimezone(pytz.UTC)
        to_date = min(to_date, _now)
    else:
        to_date = localtime(now()).astimezone(pytz.UTC)

    pandas_freq_map = {"day": "D", "week": "W-MON", "month": "MS"}
    freq = pandas_freq_map.get(request.query_params.get("frequency"))

    idx = pd.date_range(from_date.date(), to_date.date(), freq=freq)
    df = pd.DataFrame(list(filtered_queryset))
    datetime_series = pd.to_datetime(df["date"])
    datetime_index = pd.DatetimeIndex(datetime_series.values)

    df = df.set_index(datetime_index)
    df.drop("date", axis=1, inplace=True)
    df = df.asfreq(freq)
    df = df.reindex(idx, fill_value=0)
    df_json = json.JSONDecoder().decode(df.to_json(date_format="iso"))

    # this (result or 0) tomfoolery is bc I don't understand why pandas sometimes reindexes with null as the fill_value
    prepared_response = [{"date": date, "result": (result or 0)} for date, result in df_json["result"].items()]
like image 480
josh Avatar asked May 20 '26 11:05

josh


1 Answers

Below is an attempt to create a solution with panda. Basically you can resample and then reindex with the date range, but this becomes a bit clunky with the composite index

Setting up the data

import pandas as pd
data = [    { "date": "2020-01-16T00:00:00.000Z", "result": 3 }, 
            { "date": "2020-01-18T00:00:00.000Z", "result": 2 }, 
            { "date": "2020-01-13T00:00:00Z", "result": 1, "name": "Yes" }, 
            { "date": "2020-01-14T00:00:00Z", "result": 1, "name": "No" }, 
            { "date": "2020-01-16T00:00:00Z", "result": 1, "name": "No" }]

# build dataframe
df = pd.DataFrame(data )
df.name = df.name.fillna("No")
df.date = pd.to_datetime( df.date)

Then process the data

# set up date range
idx = pd.date_range( df.date.min() , df.date.max() , freq="H")

# resample yes/no for name separately
df = df.set_index(["name", "date"]).sort_index()

no = df.loc["No"].resample( rule="60min").sum().reset_index()
no["Name"] = ["No"] * len(no)
no.set_index( ["Name", "date"], inplace=True)

yes = df.loc["Yes"].resample( rule="60min").sum().reset_index()
yes["Name"] = ["Yes"] * len(yes)
yes.set_index( ["Name", "date"], inplace=True)

# reindex with the full date range
yes = yes.reindex(pd.MultiIndex.from_arrays([["Yes"]*len(idx), idx], names=('Name', 'date')), fill_value=0)
no = no.reindex(pd.MultiIndex.from_arrays([["No"]*len(idx), idx], names=('Name', 'date')), fill_value=0)

# merge and create output (dateformat has to be adjusted)
df = pd.concat( [yes, no], axis=0)
df.reset_index().to_dict('records')

Result

[{'Name': 'Yes',
  'date': Timestamp('2020-01-13 00:00:00+0000', tz='UTC'),
  'result': 1},
 {'Name': 'Yes',
  'date': Timestamp('2020-01-13 01:00:00+0000', tz='UTC'),
  'result': 0}, ....
]
like image 167
PerJensen Avatar answered May 23 '26 00:05

PerJensen