Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to search a list of named tuples?

I have a list of named tuples. Each named tuple is a DataPoint type I have created, that looks like this:

class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float

At various points throughout my code, I have to get all the DataPoints in the list by a particular attribute. Here's how I do it for analysis_date, I have similar functions for the other attributes:

def get_data_points_on_date(self, data_points, analysis_date):
    data_on_date = []
    for data_point in data_points:
        if data_point.analysis_date == analysis_date:
            data_on_date.append(data_point)
    return data_on_date

This is called >100,000 times on lists with thousands of points, so it is slowing down my script significantly.

Instead of a list, I could do a dictionary for a significant speedup, but because I need to search on multiple attributes, there isn't an obvious key. I would probably choose the function that is taking up the most time (in this case, analysis_date), and use that as the key. However, this would add significant complexity to my code. Is there anything besides hashing / a clever way to hash that is escaping me?

like image 595
P.V. Avatar asked Jul 19 '19 17:07

P.V.


People also ask

How do you search a list of tuples?

Use a list comprehension and enumerate() to search a list of tuples. Call enumerate(iterable) with a list of tuples as iterable to enumerate each tuple with its index.

Are Named tuples faster than dictionaries?

Moreover, as namedtuple instances do not have per-instance dictionaries, they are lightweight and require no more memory than regular tuples. This makes them faster than dictionaries.

Can named tuples be pickled?

So, you can pickle a named tuple. The best way is to define your named tuple the same in both the code that contains the pickling and in the code that unpickles.


2 Answers

Perhaps an in memory SQLite database (with column indexes) could help. It even has a way to map rows to named tuples as Mapping result rows to namedtuple in python sqlite describes.

For a more complete solution refer, for example, to http://peter-hoffmann.com/2010/python-sqlite-namedtuple-factory.html.


A basic example based on the two links above:

from typing import NamedTuple
from datetime import datetime
import sqlite3


class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float


def datapoint_factory(cursor, row):
    return DataPoint(*row)


def get_data_points_on_date(cursor, analysis_date):
    cursor.execute(
        f"select * from datapoints where analysis_date = '{analysis_date}'"
    )
    return cursor.fetchall()


conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute(
    "create table datapoints "
    "(data real, location_zone real, analysis_date text, error timestamp)"
)
c.execute(
    "create index if not exists analysis_date_index on datapoints (analysis_date)"
)


timestamp = datetime.now().isoformat()
data_points = [
    DataPoint(data=0.5, location_zone=0.1, analysis_date=timestamp, error=0.0)
]

for data_point in data_points:
    c.execute(f"insert into datapoints values {tuple(data_point)}")

conn.commit()
c.close()

conn.row_factory = datapoint_factory
c = conn.cursor()

print(get_data_points_on_date(c, timestamp))
# [DataPoint(data=0.5, location_zone=0.1, analysis_date='2019-07-19T20:37:38.309668', error=0)]
c.close()
like image 58
Dušan Maďar Avatar answered Sep 25 '22 19:09

Dušan Maďar


You are right that you want to avoid doing what is essentially a linear search 100,000 times if the data can be pre-computed once. Why not use multiple dictionaries, each keyed by a different attribute of interest?

Each dictionary would be pre-computed once:

self.by_date = defaultdict(list)
for point in data_points:
    self.by_date[point.analysis_date].append(point)

Now your get_data_points_for_date function becomes a one-liner:

def get_data_points_for_date(self, date):
    return self.by_date[date]

You could probably remove this method entirely, and just use self.by_date[date] instead.

This does not increase the complexity of your code, but it does transfer some of the book-keeping burden up front. You could handle that by having a set_data method that pre-computes all the dictionaries you want:

from collections import defaultdict
from operator import attrgetter

def set_data(self, data_points):
    keygetter):
        d = defaultdict(list)
        for point in data_points:
            d[key(point)].append(point)
        return d

    self.by_date = make_dict(attrgetter('analysis_date'))
    self.by_zone = make_dict(self.zone_code)

def zone_code(self, data_point):
    return int(data_point.location_zone // 0.01)

Something like zone_code is necessary to convert floats to integers, since it is not a good idea to rely on floats as keys.

like image 38
Mad Physicist Avatar answered Sep 24 '22 19:09

Mad Physicist