Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java large datastructure for storing a matrix

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.

I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.

While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don't know if it's clever to have the whole bunch in memory. My options are SQL and MS Access...

The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.

Thansk for your replies and suggestions!

Marco

like image 691
Marco Avatar asked Nov 10 '09 22:11

Marco


People also ask

Which data structure is best for storing large data?

Which among the following data structures is best suited for storing very large numbers (numbers that cannot be stored in long long int). Following are the operations needed for these large numbers. Explanation: The only two choices that make sense are Array and Linked List.

Which data structure is used to implement a matrix?

Arrays are used to implement mathematical vectors and matrices, as well as other kinds of rectangular tables.

Which data structure should I use to store 10000 elements?

So, if you have only 10000 nodes to store, you should choose a compact representation (if possible) and store this representation directly in an array on the stack.

What are the three types of structures used to store the strings?

There are three ways to store strings: Fixed length (array type structure) Variable length but maximum size is fixed during running time (pointer type structure) Linked list structure.


2 Answers

A 2d array would be more memory efficient. You can use a small hashmap to map the 952 places into a number between 0 and 951 . Then, just do:

float[][] distances= new float[952][952];

To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.

By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.

However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size

like image 128
Chi Avatar answered Oct 27 '22 00:10

Chi


I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.

EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.

  • The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.

  • The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.

like image 28
Stephen C Avatar answered Oct 27 '22 00:10

Stephen C