Fastest way to search a list of named tuples?

Tags:

optimization

I have a list of named tuples. Each named tuple is a DataPoint type I have created, that looks like this:

class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float

At various points throughout my code, I have to get all the DataPoints in the list by a particular attribute. Here's how I do it for analysis_date, I have similar functions for the other attributes:

def get_data_points_on_date(self, data_points, analysis_date):
    data_on_date = []
    for data_point in data_points:
        if data_point.analysis_date == analysis_date:
            data_on_date.append(data_point)
    return data_on_date

This is called >100,000 times on lists with thousands of points, so it is slowing down my script significantly.

Instead of a list, I could do a dictionary for a significant speedup, but because I need to search on multiple attributes, there isn't an obvious key. I would probably choose the function that is taking up the most time (in this case, analysis_date), and use that as the key. However, this would add significant complexity to my code. Is there anything besides hashing / a clever way to hash that is escaping me?

595

asked Jul 19 '19 17:07

P.V.

2 Answers

Perhaps an in memory SQLite database (with column indexes) could help. It even has a way to map rows to named tuples as Mapping result rows to namedtuple in python sqlite describes.

For a more complete solution refer, for example, to http://peter-hoffmann.com/2010/python-sqlite-namedtuple-factory.html.

A basic example based on the two links above:

from typing import NamedTuple
from datetime import datetime
import sqlite3


class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float


def datapoint_factory(cursor, row):
    return DataPoint(*row)


def get_data_points_on_date(cursor, analysis_date):
    cursor.execute(
        f"select * from datapoints where analysis_date = '{analysis_date}'"
    )
    return cursor.fetchall()


conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute(
    "create table datapoints "
    "(data real, location_zone real, analysis_date text, error timestamp)"
)
c.execute(
    "create index if not exists analysis_date_index on datapoints (analysis_date)"
)


timestamp = datetime.now().isoformat()
data_points = [
    DataPoint(data=0.5, location_zone=0.1, analysis_date=timestamp, error=0.0)
]

for data_point in data_points:
    c.execute(f"insert into datapoints values {tuple(data_point)}")

conn.commit()
c.close()

conn.row_factory = datapoint_factory
c = conn.cursor()

print(get_data_points_on_date(c, timestamp))
# [DataPoint(data=0.5, location_zone=0.1, analysis_date='2019-07-19T20:37:38.309668', error=0)]
c.close()

answered Sep 25 '22 19:09

Dušan Maďar

You are right that you want to avoid doing what is essentially a linear search 100,000 times if the data can be pre-computed once. Why not use multiple dictionaries, each keyed by a different attribute of interest?

Each dictionary would be pre-computed once:

self.by_date = defaultdict(list)
for point in data_points:
    self.by_date[point.analysis_date].append(point)

Now your get_data_points_for_date function becomes a one-liner:

def get_data_points_for_date(self, date):
    return self.by_date[date]

You could probably remove this method entirely, and just use self.by_date[date] instead.

This does not increase the complexity of your code, but it does transfer some of the book-keeping burden up front. You could handle that by having a set_data method that pre-computes all the dictionaries you want:

from collections import defaultdict
from operator import attrgetter

def set_data(self, data_points):
    keygetter):
        d = defaultdict(list)
        for point in data_points:
            d[key(point)].append(point)
        return d

    self.by_date = make_dict(attrgetter('analysis_date'))
    self.by_zone = make_dict(self.zone_code)

def zone_code(self, data_point):
    return int(data_point.location_zone // 0.01)

Something like zone_code is necessary to convert floats to integers, since it is not a good idea to rely on floats as keys.

answered Sep 24 '22 19:09

Mad Physicist

Related questions
                            
                                Dash dataTable conditional cell formatting isn't working
                            
                                Pyinstaller executable fails importing torchvision
                            
                                Is assigning two variables to items in the same list the best way to access and perform operations on those items?
                            
                                Spark Caused by: java.lang.StackOverflowError Window Function?
                            
                                Kubernetes argo loop through json array
                            
                                How to trigger a python function inside a tf.keras custom loss function?
                            
                                Struggling with understanding the reason why Python needs Virtual Environments
                            
                                How to create user in amazon-cognito using boto3 in python
                            
                                Can't save data from yfinance into a CSV file
                            
                                PANDAS: int32 overflow? Can't bulid a pivot table
                            
                                How to make Keras compute a certain metric on validation data only?
                            
                                No batch_size while making inference with BERT model
                            
                                Put the legend of pandas bar plot with secondary y axis in front of bars
                            
                                Why does zip return tuples?
                            
                                Airflow: Re-run DAG from beginning with new schedule
                            
                                Keras: What is the difference between model and layers?
                            
                                How to install plaidML / plaidML-keras
                            
                                How to add a proper 'meta.yaml' recipe file for creating a conda-forge package distribution? Particularly `test` section in recipe file?
                            
                                Install npm package with conda via environment.yml
                            
                                dictionary keys to replace strings in pandas dataframe column with dictionary values and perform evaluate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With