I have a list of named tuples. Each named tuple is a DataPoint
type I have created, that looks like this:
class DataPoint(NamedTuple):
data: float
location_zone: float
analysis_date: datetime
error: float
At various points throughout my code, I have to get all the DataPoints
in the list by a particular attribute. Here's how I do it for analysis_date
, I have similar functions for the other attributes:
def get_data_points_on_date(self, data_points, analysis_date):
data_on_date = []
for data_point in data_points:
if data_point.analysis_date == analysis_date:
data_on_date.append(data_point)
return data_on_date
This is called >100,000 times on lists with thousands of points, so it is slowing down my script significantly.
Instead of a list, I could do a dictionary for a significant speedup, but because I need to search on multiple attributes, there isn't an obvious key. I would probably choose the function that is taking up the most time (in this case, analysis_date
), and use that as the key. However, this would add significant complexity to my code. Is there anything besides hashing / a clever way to hash that is escaping me?
Use a list comprehension and enumerate() to search a list of tuples. Call enumerate(iterable) with a list of tuples as iterable to enumerate each tuple with its index.
Moreover, as namedtuple instances do not have per-instance dictionaries, they are lightweight and require no more memory than regular tuples. This makes them faster than dictionaries.
So, you can pickle a named tuple. The best way is to define your named tuple the same in both the code that contains the pickling and in the code that unpickles.
Perhaps an in memory SQLite database (with column indexes) could help. It even has a way to map rows to named tuples as Mapping result rows to namedtuple in python sqlite describes.
For a more complete solution refer, for example, to http://peter-hoffmann.com/2010/python-sqlite-namedtuple-factory.html.
A basic example based on the two links above:
from typing import NamedTuple
from datetime import datetime
import sqlite3
class DataPoint(NamedTuple):
data: float
location_zone: float
analysis_date: datetime
error: float
def datapoint_factory(cursor, row):
return DataPoint(*row)
def get_data_points_on_date(cursor, analysis_date):
cursor.execute(
f"select * from datapoints where analysis_date = '{analysis_date}'"
)
return cursor.fetchall()
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute(
"create table datapoints "
"(data real, location_zone real, analysis_date text, error timestamp)"
)
c.execute(
"create index if not exists analysis_date_index on datapoints (analysis_date)"
)
timestamp = datetime.now().isoformat()
data_points = [
DataPoint(data=0.5, location_zone=0.1, analysis_date=timestamp, error=0.0)
]
for data_point in data_points:
c.execute(f"insert into datapoints values {tuple(data_point)}")
conn.commit()
c.close()
conn.row_factory = datapoint_factory
c = conn.cursor()
print(get_data_points_on_date(c, timestamp))
# [DataPoint(data=0.5, location_zone=0.1, analysis_date='2019-07-19T20:37:38.309668', error=0)]
c.close()
You are right that you want to avoid doing what is essentially a linear search 100,000 times if the data can be pre-computed once. Why not use multiple dictionaries, each keyed by a different attribute of interest?
Each dictionary would be pre-computed once:
self.by_date = defaultdict(list)
for point in data_points:
self.by_date[point.analysis_date].append(point)
Now your get_data_points_for_date
function becomes a one-liner:
def get_data_points_for_date(self, date):
return self.by_date[date]
You could probably remove this method entirely, and just use self.by_date[date]
instead.
This does not increase the complexity of your code, but it does transfer some of the book-keeping burden up front. You could handle that by having a set_data method that pre-computes all the dictionaries you want:
from collections import defaultdict
from operator import attrgetter
def set_data(self, data_points):
keygetter):
d = defaultdict(list)
for point in data_points:
d[key(point)].append(point)
return d
self.by_date = make_dict(attrgetter('analysis_date'))
self.by_zone = make_dict(self.zone_code)
def zone_code(self, data_point):
return int(data_point.location_zone // 0.01)
Something like zone_code
is necessary to convert float
s to integers, since it is not a good idea to rely on float
s as keys.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With