I am running a django app with a postgreSQL database and I am trying to send a very large dictionary (consisting of time-series data) to the database.
My goal is to write my data into the DB as fast as possible. I am using the library requests to send the data via an API-call (built with django REST):
My API-view is simple:
@api_view(["POST"])
def CreateDummy(request):
for elem, ts in request.data['time_series'] :
TimeSeries.objects.create(data_json=ts)
msg = {"detail": "Created successfully"}
return Response(msg, status=status.HTTP_201_CREATED)
request.data['time_series']
is a huge dictionary structured like this:
{Building1: {1:123, 2: 345, 4:567 .... 31536000: 2345}, .... Building30: {..... }}
That means I am having 30 keys with 30 values, whereas the values are each a dict with 31536000 elements.
My API request looks like this (where data is my dictionary described above):
payload = {
"time_series": data,
}
requests.request(
"post", url=endpoint, json=payload
)
The code saves the time-series data to a jsonb-field in the backend. Now that works if I only loop over the first 4 elements of the dictionary. I can get that data in in about 1minute. But when I loop over the whole dict, my development server shuts down. I guess it's because the memory is insufficient. I get a requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
. Is the whole dict saved to memory before it starts iterating? I doubt it because I read that in python3 looping with .items()
returns an iterator and is the preferred way to do this.
Is there a better way to deal with massive dicts in django/python? Should I loop through half of it and then through the other half? Or is there a faster way? Maybe using pandas
? Or maybe sending the data differently? I guess I am looking for the most performant way to do this.
Happy to provide more code if needed.
Any help, hints or guides are very much appreciated! Thanks in advance
EDIT2: I think it is not my RAM usage or the size of the dict. I still have 5GiB of RAM left when the server shuts down. ~~And the size of the dict is 1176bytes~~ Dict is much larger, see comments
EDIT3: I can't even print the huge dict. It also shuts down then
EDIT4: When split the data up and send it not all at once the server can handle it. But when I try to query it back the server breaks again. It breaks on my production server (nginx AWS RDS setup) and it breaks on my local dev server. I am pretty sure it's because django can't handle queries that big with my current setup. But how could I solve this?
EDIT5: So what I am looking for is a two part solution. One for the creation of the data and one for the querying of the data. The creation of the data I described above. But even if I get all that data into the database, I will still have problems getting it out again.
I tried this by creating the data not all together but every time-series on its own. So let's assume I have this huge data in my DB and I try to query it back. All time-series objects belong to a network so I tried this like so:
class TimeSeriesByTypeAndCreationMethod(ListAPIView):
"""Query time-series in specific network."""
serializer_class = TimeSeriesSerializer
def get_queryset(self):
"""Query time-series
Query by name of network, type of data, creation method and
source.
"""
network = self.kwargs["name_network"]
if TimeSeries.objects.filter(
network_element__network__name=network,
).exists():
time_series = TimeSeries.objects.filter(
network_element__network__name=network,
)
return time_series
else:
raise NotFound()
But the query breaks the server like the data creation before. I think also this is too much data load. I thought I could use raw sql avoid breaking the server... Or is there also a better way?
EDIT6: Relevant models:
class TimeSeries(models.Model):
TYPE_DATA_CHOICES = [
....many choices...
]
CREATION_METHOD_CHOICES = [
....many choices...
]
description = models.CharField(
max_length=120,
null=True,
blank=True,
)
network_element = models.ForeignKey(
Building,
on_delete=models.CASCADE,
null=True,
blank=True,
)
type_data = models.CharField(
null=True,
blank=True,
max_length=30,
choices=TYPE_DATA_CHOICES,
)
creation_method = models.CharField(
null=True,
blank=True,
max_length=30,
choices=CREATION_METHOD_CHOICES,
)
source = models.CharField(
null=True,
blank=True,
max_length=300
)
data_json = JSONField(
help_text="Data for time series in JSON format. Valid JSON expected."
)
creation_date = models.DateTimeField(auto_now=True, null=True, blank=True)
def __str__(self):
return f"{self.creation_method}:{self.type_data}"
class Building(models.Model):
USAGE_CHOICES = [
...
]
name = models.CharField(
max_length=120,
null=True,
blank=True,
)
street = models.CharField(
max_length=120,
null=True,
blank=True,
)
house_number = models.CharField(
max_length=20,
null=True,
blank=True,
)
zip_code = models.CharField(
max_length=5,
null=True,
blank=True,
)
city = models.CharField(
max_length=120,
null=True,
blank=True,
)
usage = models.CharField(
max_length=120,
choices=USAGE_CHOICES,
null=True,
blank=True,
)
.....many more fields....
You can solve your issues using two techniques.
Use bulk_create to insert a large number of records, if SQL error happens due to large query size etc then provide the batch_size
in bulk_create
.
records = []
for elem, ts in request.data['time_series'] :
records.append(
TimeSeries(data_json=ts)
)
# setting batch size t 1000
TimeSeries.objects.bulk_create(records, batch_size=1000)
There're some caveats with bulk_create like it will not generate signals and others see more in Doc
Configure rest framework to use pagination default configuration
REST_FRAMEWORK = {
'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.LimitOffsetPagination',
'PAGE_SIZE': 100
}
For custom configuration use
class TimeSeriesResultsSetPagination(PageNumberPagination):
page_size = 50
page_size_query_param = 'page_size'
max_page_size = 10000
class BillingRecordsView(generics.ListAPIView):
serializer_class = TimeSeriesSerializer
pagination_class = TimeSeriesResultsSetPagination
def get_queryset(self):
"""Query time-series
Query by name of network, type of data, creation method and
source.
"""
network = self.kwargs["name_network"]
if TimeSeries.objects.filter(
network_element__network__name=network,
).exists():
time_series = TimeSeries.objects.filter(
network_element__network__name=network,
)
return time_series
else:
raise NotFound()
See other techniques for pagination at https://www.django-rest-framework.org/api-guide/pagination/
@micromegas when your solution is correct theoretically, however calling create() many times in a loop, I believe that causes the ConnectionError exception.
try to refactor to something like:
big_data_holder = []
for elem, ts in request.data['time_series'] :
big_data_holder.append(
TimeSeries(data_json=ts)
)
# examine the structure
print(big_data_holder)
TimeSeries.objects.bulk_create(big_data_holder)
please check for some downsides for this method Django Docs bulk_create
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With