Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store a sparse queryable matrix on disk or database?

I need to store a sparse matrix on disk. It is like a database table with millions of rows and thousands of columns, where many or most columns are null. It needs to be queryable, like a SQL SELECT with a WHERE on some of the columns.

My specific requirement is on Java. I first thought of using Berkeley DB for Java to simulate a table, but then it does not support querying based on values.

Then, I thought about using a regular SQL database. For example, creating a schema with only a Row ID, a Column ID, and the value. The virtual row will be all the actual rows with the same ID. But then, this looks like database abuse.

Any ideas?

like image 284
mparaz Avatar asked Feb 28 '09 14:02

mparaz


3 Answers

The first thing that came to my mind when reading the question heading was a database row per (x,y) as you suggested in your next to last paragraph.

The other thing to note is that databases often compress the rows, particularly for NULLs, so the straightforward representation may not waste as much space as you think.

like image 62
Doug Currie Avatar answered Sep 28 '22 18:09

Doug Currie


It depends on your definition of "many or most columns are null", but that sounds like a very reasonable approach, assuming that you really need the random access.

If you can do everything by sequential processing (e.g. a scan in row order) then a flat file would be another reasonable option to consider.

like image 44
joel.neely Avatar answered Sep 28 '22 17:09

joel.neely


the Intersystems Cache database uses structures internally for storing data, which are sparse multidimensional arrays. Maybe check that out. You can query it, and map it to SQL tables. I'm not sure if you can directly access the multidimensional arrays in Intersystems Cache from java though.

like image 38
Raymond Roestenburg Avatar answered Sep 28 '22 16:09

Raymond Roestenburg