Data structure for maintaining tabular data in memory?

Tags:

My scenario is as follows: I have a table of data (handful of fields, less than a hundred rows) that I use extensively in my program. I also need this data to be persistent, so I save it as a CSV and load it on start-up. I choose not to use a database because every option (even SQLite) is an overkill for my humble requirement (also - I would like to be able to edit the values offline in a simple way, and nothing is simpler than notepad).

Assume my data looks as follows (in the file it's comma separated without titles, this is just an illustration):

 Row  | Name     | Year   | Priority ------------------------------------  1    | Cat      | 1998   | 1  2    | Fish     | 1998   | 2  3    | Dog      | 1999   | 1   4    | Aardvark | 2000   | 1  5    | Wallaby  | 2000   | 1  6    | Zebra    | 2001   | 3

Notes:

Row may be a "real" value written to the file or just an auto-generated value that represents the row number. Either way it exists in memory.
Names are unique.

Things I do with the data:

Look-up a row based on either ID (iteration) or name (direct access).
Display the table in different orders based on multiple field: I need to sort it e.g. by Priority and then Year, or Year and then Priority, etc.
I need to count instances based on sets of parameters, e.g. how many rows have their year between 1997 and 2002, or how many rows are in 1998 and priority > 2, etc.

I know this "cries" for SQL...

I'm trying to figure out what's the best choice for data structure. Following are several choices I see:

List of row lists:

a = [] a.append( [1, "Cat", 1998, 1] ) a.append( [2, "Fish", 1998, 2] ) a.append( [3, "Dog", 1999, 1] ) ...

List of column lists (there will obviously be an API for add_row etc):

a = [] a.append( [1, 2, 3, 4, 5, 6] ) a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] ) a.append( [1998, 1998, 1999, 2000, 2000, 2001] ) a.append( [1, 2, 1, 1, 1, 3] )

Dictionary of columns lists (constants can be created to replace the string keys):

a = {} a['ID'] = [1, 2, 3, 4, 5, 6] a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"]  a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001]  a['Priority'] = [1, 2, 1, 1, 1, 3]

Dictionary with keys being tuples of (Row, Field):

Create constants to avoid string searching NAME=1 YEAR=2 PRIORITY=3  a={} a[(1, NAME)] = "Cat" a[(1, YEAR)] = 1998 a[(1, PRIORITY)] = 1 a[(2, NAME)] = "Fish" a[(2, YEAR)] = 1998 a[(2, PRIORITY)] = 2 ...

And I'm sure there are other ways... However each way has disadvantages when it comes to my requirements (complex ordering and counting).

What's the recommended approach?

EDIT:

To clarify, performance is not a major issue for me. Because the table is so small, I believe almost every operation will be in the range of milliseconds, which is not a concern for my application.

354

asked Jun 24 '09 12:06

Roee Adler

2 Answers

Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?

connection = sqlite3.connect(':memory:')

Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.

Here is a sample of how one might create and populate the database:

import csv import sqlite3  db = sqlite3.connect(':memory:')  def init_db(cur):     cur.execute('''CREATE TABLE foo (         Row INTEGER,         Name TEXT,         Year INTEGER,         Priority INTEGER)''')  def populate_db(cur, csv_fp):     rdr = csv.reader(csv_fp)     cur.executemany('''         INSERT INTO foo (Row, Name, Year, Priority)         VALUES (?,?,?,?)''', rdr)  cur = db.cursor() init_db(cur) populate_db(cur, open('my_csv_input_file.csv')) db.commit()

If you'd really prefer not to use SQL, you should probably use a list of dictionaries:

lod = [ ] # "list of dicts"  def populate_lod(lod, csv_fp):     rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])     lod.extend(rdr)  def query_lod(lod, filter=None, sort_keys=None):     if filter is not None:         lod = (r for r in lod if filter(r))     if sort_keys is not None:         lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])     else:         lod = list(lod)     return lod  def lookup_lod(lod, **kw):     for row in lod:         for k,v in kw.iteritems():             if row[k] != str(v): break         else:             return row     return None

Testing then yields:

>>> lod = [] >>> populate_lod(lod, csv_fp) >>>  >>> pprint(lookup_lod(lod, Row=1)) {'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'} >>> pprint(lookup_lod(lod, Name='Aardvark')) {'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'} >>> pprint(query_lod(lod, sort_keys=('Priority', 'Year'))) [{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},  {'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},  {'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},  {'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},  {'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},  {'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}] >>> pprint(query_lod(lod, sort_keys=('Year', 'Priority'))) [{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},  {'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},  {'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},  {'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},  {'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},  {'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}] >>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002)) 6 >>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2)) 0

Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.

answered Sep 29 '22 15:09

Rick Copeland

A very old question I know but...

A pandas DataFrame seems to be the ideal option here.

http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html

From the blurb

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure

http://pandas.pydata.org/

answered Sep 29 '22 17:09

user5061

Related questions
                            
                                Django Model Mixins: inherit from models.Model or from object?
                            
                                c++11 regex slower than python
                            
                                Pandas Groupby and Sum Only One Column
                            
                                Convert Bytes to Floating Point Numbers?
                            
                                How does one monkey patch a function in python?
                            
                                Remove NaN from pandas series
                            
                                How can I retrieve the current seed of NumPy's random number generator?
                            
                                ValueError: invalid literal for int () with base 10
                            
                                Multiprocessing example giving AttributeError
                            
                                How to make an "always relative to current module" file path?
                            
                                Flask example with POST
                            
                                How to convert a list of numbers to jsonarray in Python
                            
                                StaleElementReferenceException on Python Selenium
                            
                                How to run Scrapy from within a Python script
                            
                                Is it safe to rely on condition evaluation order in if statements?
                            
                                Python multiprocessing and a shared counter
                            
                                How to get the current working directory using python 3?
                            
                                AttributeError: 'DataFrame' object has no attribute 'ix'
                            
                                PANDAS plot multiple Y axes
                            
                                long running py.test stop at first failure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data structure for maintaining tabular data in memory?

Tags:

python

data-structures

Roee Adler

People also ask

2 Answers

Rick Copeland

user5061

Recent Activity

Donate For Us