Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:

My data will basically take the form of a multidimensional array where each entry will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has roughly the following numbers of potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10,000

Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.

So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.

I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:

  1. a relational database

  2. PyTables

  3. Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)

There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.

What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...

That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.

I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.

Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

like image 940
dbb Avatar asked Feb 20 '11 21:02

dbb


People also ask

How does Python store multiple dimensional data?

Multidimensional Array concept can be explained as a technique of defining and storing the data on a format with more than two dimensions (2D). In Python, Multidimensional Array can be implemented by fitting in a list function inside another list function, which is basically a nesting operation for the list function.

What are multi dimensional datasets?

Multidimensional data is a data set with many different columns, also called features or attributes. The more columns in the data set, the more likely you are to discover hidden insights. In this case, two-dimensional analysis falls flat. Think of this data as being in a cube on multiple planes.

What is multidimensional data analysis?

Multidimensional data analysis is a method that, according to the management information demands, users observe the data from several angles of the real world and make a corresponding processing so as to get some useful information. Multidimensional data analysis cannot do without the foundation of data warehouse.


1 Answers

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

like image 199
FrancescAlted Avatar answered Sep 20 '22 23:09

FrancescAlted