I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated!
I have a long .csv file in the following format:
Scenario,Year,Month,Value
1,1961,1,0.5
1,1961,2,0.7
1,1961,3,0.2
etc.
My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29*12 = 34800 rows, each with an associated value.
I'd like to read this file into some kind of Python data structure so that I can access a 'Value' by specifying the 'Scenario', 'Year' and 'Month'. What's the best way to do this please (or what are the various options)?
In my head I think of this data as a kind of 'number cuboid' with axes for Scenario, Year and Month, so that each Value is located at co-ordinates (Scenario, Year, Month). For this reason, I'm tempted to try to read these values into a 3D numpy array and use Scenario, Year and Month as indices. Is this a sensible thing to do?
I guess I could also make a dictionary where the keys are something like
str(Scenario)+str(Year)+str(Month)
Would this be better? Are there other options?
(By 'better' I suppose I mean 'faster to access', although if one method is much less memory intensive than another it'd be good to know about that too).
Thanks very much!
I'd use a dict of tuples. Simple, fast, and a hash-table look-up to retrieve a single value:
import csv
reader = csv.reader(open('data.csv', 'rb'))
header = reader.next()
data = {}
for row in reader:
key = tuple([int(v) for v in row[:-1]])
val = row[-1]
data[key] = float(val)
# Retrieve a value
print data[1, 1961, 3]
I would use sqlite3 for storing the data to disk. You'll be able to read in the full data set or subsets through SQL queries. You can then load that data into a numpy array or other Python data structure -- whatever is most convenient for the task.
If you do choose to use sqlite, also note that sqlite has a TIMESTAMP data type.
It may be a good idea to combine the year and month into one TIMESTAMP. When you read TIMESTAMPs into Python, sqlite3
can be told to automatically convert the TIMESTAMPs into datetime.datetime
objects, which would reduce some of the boilerplate code you'd otherwise have to write. It will also make it easier to form SQL queries which ask for all the rows between two dates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With