I need to store a sparse matrix on disk. It is like a database table with millions of rows and thousands of columns, where many or most columns are null. It needs to be queryable, like a SQL SELECT with a WHERE on some of the columns.
My specific requirement is on Java. I first thought of using Berkeley DB for Java to simulate a table, but then it does not support querying based on values.
Then, I thought about using a regular SQL database. For example, creating a schema with only a Row ID, a Column ID, and the value. The virtual row will be all the actual rows with the same ID. But then, this looks like database abuse.
Any ideas?
The first thing that came to my mind when reading the question heading was a database row per (x,y) as you suggested in your next to last paragraph.
The other thing to note is that databases often compress the rows, particularly for NULLs, so the straightforward representation may not waste as much space as you think.
It depends on your definition of "many or most columns are null", but that sounds like a very reasonable approach, assuming that you really need the random access.
If you can do everything by sequential processing (e.g. a scan in row order) then a flat file would be another reasonable option to consider.
the Intersystems Cache database uses structures internally for storing data, which are sparse multidimensional arrays. Maybe check that out. You can query it, and map it to SQL tables. I'm not sure if you can directly access the multidimensional arrays in Intersystems Cache from java though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With