Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sending large dictionary via API call breaks development server

I am running a django app with a postgreSQL database and I am trying to send a very large dictionary (consisting of time-series data) to the database.

My goal is to write my data into the DB as fast as possible. I am using the library requests to send the data via an API-call (built with django REST):

My API-view is simple:

@api_view(["POST"])
def CreateDummy(request):

    for elem, ts in request.data['time_series'] :
        TimeSeries.objects.create(data_json=ts)

    msg = {"detail": "Created successfully"}
    return Response(msg, status=status.HTTP_201_CREATED)

request.data['time_series'] is a huge dictionary structured like this:

{Building1: {1:123, 2: 345, 4:567 .... 31536000: 2345}, .... Building30: {..... }}

That means I am having 30 keys with 30 values, whereas the values are each a dict with 31536000 elements.

My API request looks like this (where data is my dictionary described above):

 payload = {
            "time_series": data,
           } 

 requests.request(
        "post", url=endpoint, json=payload
    )

The code saves the time-series data to a jsonb-field in the backend. Now that works if I only loop over the first 4 elements of the dictionary. I can get that data in in about 1minute. But when I loop over the whole dict, my development server shuts down. I guess it's because the memory is insufficient. I get a requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Is the whole dict saved to memory before it starts iterating? I doubt it because I read that in python3 looping with .items() returns an iterator and is the preferred way to do this.

Is there a better way to deal with massive dicts in django/python? Should I loop through half of it and then through the other half? Or is there a faster way? Maybe using pandas? Or maybe sending the data differently? I guess I am looking for the most performant way to do this.

Happy to provide more code if needed.

Any help, hints or guides are very much appreciated! Thanks in advance

EDIT2: I think it is not my RAM usage or the size of the dict. I still have 5GiB of RAM left when the server shuts down. ~~And the size of the dict is 1176bytes~~ Dict is much larger, see comments

EDIT3: I can't even print the huge dict. It also shuts down then

EDIT4: When split the data up and send it not all at once the server can handle it. But when I try to query it back the server breaks again. It breaks on my production server (nginx AWS RDS setup) and it breaks on my local dev server. I am pretty sure it's because django can't handle queries that big with my current setup. But how could I solve this?

EDIT5: So what I am looking for is a two part solution. One for the creation of the data and one for the querying of the data. The creation of the data I described above. But even if I get all that data into the database, I will still have problems getting it out again.

I tried this by creating the data not all together but every time-series on its own. So let's assume I have this huge data in my DB and I try to query it back. All time-series objects belong to a network so I tried this like so:


class TimeSeriesByTypeAndCreationMethod(ListAPIView):
    """Query time-series in specific network."""

    serializer_class = TimeSeriesSerializer

    def get_queryset(self):
        """Query time-series

        Query by name of network, type of data, creation method and
        source.
        """

        network = self.kwargs["name_network"]

        if TimeSeries.objects.filter(
            network_element__network__name=network,
        ).exists():
            time_series = TimeSeries.objects.filter(
                network_element__network__name=network,
            )
            return time_series
        else:
            raise NotFound()

But the query breaks the server like the data creation before. I think also this is too much data load. I thought I could use raw sql avoid breaking the server... Or is there also a better way?

EDIT6: Relevant models:


class TimeSeries(models.Model):

    TYPE_DATA_CHOICES = [
        ....many choices...
    ]

    CREATION_METHOD_CHOICES = [
        ....many choices...
    ]

    description = models.CharField(
        max_length=120,
        null=True,
        blank=True,
    )

    network_element = models.ForeignKey(
        Building,
        on_delete=models.CASCADE,
        null=True,
        blank=True,
    )
    type_data = models.CharField(
        null=True,
        blank=True,
        max_length=30,
        choices=TYPE_DATA_CHOICES,
    )

    creation_method = models.CharField(
        null=True,
        blank=True,
        max_length=30,
        choices=CREATION_METHOD_CHOICES,
    )

    source = models.CharField(
        null=True,
        blank=True,
        max_length=300
    )

    data_json = JSONField(
        help_text="Data for time series in JSON format. Valid JSON expected."
    )

    creation_date = models.DateTimeField(auto_now=True, null=True, blank=True)

    def __str__(self):
        return f"{self.creation_method}:{self.type_data}"



class Building(models.Model):

    USAGE_CHOICES = [
        ...
    ]

    name = models.CharField(
        max_length=120,
        null=True,
        blank=True,
    )
    street = models.CharField(
        max_length=120,
        null=True,
        blank=True,
    )
    house_number = models.CharField(
        max_length=20,
        null=True,
        blank=True,
    )
    zip_code = models.CharField(
        max_length=5,
        null=True,
        blank=True,
    )
    city = models.CharField(
        max_length=120,
        null=True,
        blank=True,
    )
    usage = models.CharField(
        max_length=120,
        choices=USAGE_CHOICES,
        null=True,
        blank=True,
    )
    .....many more fields....
   
like image 453
Micromegas Avatar asked Oct 13 '20 01:10

Micromegas


2 Answers

You can solve your issues using two techniques.

Data Creation

Use bulk_create to insert a large number of records, if SQL error happens due to large query size etc then provide the batch_size in bulk_create.

records = []
for elem, ts in request.data['time_series'] :
    records.append(
         TimeSeries(data_json=ts)
    )

# setting batch size t 1000

TimeSeries.objects.bulk_create(records, batch_size=1000)

There're some caveats with bulk_create like it will not generate signals and others see more in Doc

Data Retrieval

Configure rest framework to use pagination default configuration

REST_FRAMEWORK = {
    'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.LimitOffsetPagination',
    'PAGE_SIZE': 100
}

For custom configuration use

class TimeSeriesResultsSetPagination(PageNumberPagination):
    page_size = 50
    page_size_query_param = 'page_size'
    max_page_size = 10000

class BillingRecordsView(generics.ListAPIView):
   serializer_class = TimeSeriesSerializer
   pagination_class = TimeSeriesResultsSetPagination


   def get_queryset(self):
    """Query time-series

    Query by name of network, type of data, creation method and
    source.
    """

    network = self.kwargs["name_network"]

    if TimeSeries.objects.filter(
        network_element__network__name=network,
    ).exists():
        time_series = TimeSeries.objects.filter(
            network_element__network__name=network,
        )
        return time_series
    else:
        raise NotFound()
   

See other techniques for pagination at https://www.django-rest-framework.org/api-guide/pagination/

like image 69
sonus21 Avatar answered Nov 16 '22 01:11

sonus21


@micromegas when your solution is correct theoretically, however calling create() many times in a loop, I believe that causes the ConnectionError exception.

try to refactor to something like:

big_data_holder = []
for elem, ts in request.data['time_series'] :
    big_data_holder.append(
         TimeSeries(data_json=ts)
    )

# examine the structure 
print(big_data_holder) 

TimeSeries.objects.bulk_create(big_data_holder)

please check for some downsides for this method Django Docs bulk_create

like image 26
Ahmed Shehab Avatar answered Nov 16 '22 01:11

Ahmed Shehab