Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most appropriate data structure (Python)

I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated!

I have a long .csv file in the following format:

Scenario,Year,Month,Value
1,1961,1,0.5
1,1961,2,0.7
1,1961,3,0.2
etc.

My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29*12 = 34800 rows, each with an associated value.

I'd like to read this file into some kind of Python data structure so that I can access a 'Value' by specifying the 'Scenario', 'Year' and 'Month'. What's the best way to do this please (or what are the various options)?

In my head I think of this data as a kind of 'number cuboid' with axes for Scenario, Year and Month, so that each Value is located at co-ordinates (Scenario, Year, Month). For this reason, I'm tempted to try to read these values into a 3D numpy array and use Scenario, Year and Month as indices. Is this a sensible thing to do?

I guess I could also make a dictionary where the keys are something like

str(Scenario)+str(Year)+str(Month)

Would this be better? Are there other options?

(By 'better' I suppose I mean 'faster to access', although if one method is much less memory intensive than another it'd be good to know about that too).

Thanks very much!

like image 905
JamesS Avatar asked Nov 12 '10 11:11

JamesS


2 Answers

I'd use a dict of tuples. Simple, fast, and a hash-table look-up to retrieve a single value:

import csv

reader = csv.reader(open('data.csv', 'rb'))
header = reader.next()
data = {}

for row in reader:
    key = tuple([int(v) for v in row[:-1]])
    val = row[-1]
    data[key] = float(val)

# Retrieve a value
print data[1, 1961, 3]
like image 71
fmark Avatar answered Oct 04 '22 01:10

fmark


I would use sqlite3 for storing the data to disk. You'll be able to read in the full data set or subsets through SQL queries. You can then load that data into a numpy array or other Python data structure -- whatever is most convenient for the task.

If you do choose to use sqlite, also note that sqlite has a TIMESTAMP data type. It may be a good idea to combine the year and month into one TIMESTAMP. When you read TIMESTAMPs into Python, sqlite3 can be told to automatically convert the TIMESTAMPs into datetime.datetime objects, which would reduce some of the boilerplate code you'd otherwise have to write. It will also make it easier to form SQL queries which ask for all the rows between two dates.

like image 25
unutbu Avatar answered Oct 04 '22 01:10

unutbu