Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of guava Table vs multiple hash maps

I ran across some code that was doing something like this:

Map<String,String> fullNameById = buildMap1(dataSource1);
Map<String,String> nameById = buildMap2(dataSource2);
Map<String,String> nameByFullName = new HashMap<String,String>();
Map<String,String> idByName = new HashMap<String,String>();

Set<String> ids = fullNameById.keySet();
for (String nextId : ids) {
  String name = nameById.get(nextId);
  String fullName = fullNameById.get(nextId);
  nameByFullName.put(fullName, name);
  idByName.put(name, nextId);
}

I had to stare at it for several minutes to figure out what was going on. All of that amounts to a join operation on id's and an inversion of one of the original maps. Since Id, FullName and Name are always 1:1:1 it seemed to me that there should be some way to simplify this. I also discovered that the first two maps are never used again, and I find that the above code is a bit hard to read. So I'm considering replacing it with something like this that (to me) reads much cleaner

Table<String, String, String> relations = HashBasedTable.create();

addRelationships1(dataSource1, relations);
addRelationships2(dataSource2, relations);

Map<String,String> idByName = relations.column("hasId");
Map<String,String> nameByFullName = relations.column("hasName");
relations = null; // not used hereafter

In addRelationships1 I do

relations.put(id, "hasFullName", fullname);

And in addRelationships2 where my query yields values for id and name I do

relations.put(relations.remove(id,"hasFullName"), "hasName", name);
relations.put(name, "hasId", id);

So my questions are these:

  1. Is there a lurking inefficiency in what I have done either via processor or memory, or GC load? I don't think so, but I'm not that familiar with the efficiency of Table. I am aware that the Table object won't be GC'd after relations = null, I just want to communicate that it's not used again in the rather lengthy section of code that follows.
  2. Have I gained any efficiency? I keep convincing and unconvincing myself that I have and have not.
  3. Do you find this more readable? Or is this only easy for me to read because I wrote it? I'm a tad worried on that front due to the fact Table is not well known. On the other hand, the top level now pretty clearly says, "gather data from two sources and make these two maps from it." I also like the fact that it doesn't leave you wondering if/where the other two maps are being used (or not).
  4. Do you have an even better, cleaner, faster, simpler way to do it than either of the above?

Please Lets not have the optimize early/late discussion here. I'm well aware of that pitfall. If it improves readability without hurting performance I am satisfied. Performance gain would be a nice bonus.

Note: my variable and method names have been sanitized here to keep the business area from distracting from the discussion, I definitely won't name them addRelationships1 or datasource1! Similarly, the final code will of course use constants not raw strings.

like image 541
Gus Avatar asked Mar 01 '13 19:03

Gus


2 Answers

So I did some mini benchmarking myself and came up with the conclusion that there is little difference in the two methods in terms of execution time. I kept the total size of the data being processed constant by trading runs for data-set size. I did 4 runs and chose the lowest time for each implementation from among all 4 runs. Re-reassuringly both implementations were always fastest on the same run. My code can be found here. Here are my results:

Case                      Maps (ms)   Table (ms)    Table vs Maps
100000 runs of size 10    2931        3035          104%
10000 runs of size 100    2989        3033          101%
1000 runs of size 1000    3129        3160          101%
100 runs of size 10000    4126        4429          107%
10 runs of size 100000    5081        5866          115%
1 run  of size 1000000    5489        5160          94%

So using Table seems to be slightly slower for small data sets. Something interesting happens around 100,000 and then by 1 million the table is actually faster. My data will hang out in the 100 to 1000 range, so at least in execution time the performance should be nearly identical.

As for readability, my opinion is that if someone is trying to figure out what is happening near by and reads the code it will be significantly easier to see the intent. If they have to actually debug this bit of code it may be a bit harder since Table is less common, and requires some sophistication to understand.

Another thing I am unsure of is whether or not it's more efficient to create the hash maps, or to just query the table directly in the case where all keys of the map will subsequently be iterated. However that's a different question :)

And the comedic ending is that in fact as I analyzed the code further (hundreds of lines), I found that the only significant use of nameByFullname.get() outside of logging (of questionable value) was to pass the result of the to idByName.get(). So in the end I'll actually be building an idByFullName map and an idByName map instead with no need for any joining, and dropping the whole table thing anyway. But it made for an interesting SO question I guess.

like image 146
Gus Avatar answered Nov 11 '22 23:11

Gus


tl;dr, but I'm afraid that you'd need to make a bigger step away from the original design. Simulating DB tables might be a nice exercise, but for me your code isn't really readable.

  1. Is there a lurking inefficiency in what I have done... No idea.
  2. Have I gained any efficiency? I'm afraid you need to measure it first. Removing some indirections surely helps, but using a more complicated data structure might offset it. And performance in general is simply too complicated.
  3. Do you find this more readable? I'm afraid not.
  4. Do you have an even better, cleaner, faster, simpler way to do it than either of the above? I hope so....

Where I get lost in such code is the use of strings for everything - it's just too easy to pass a wrong string as an argument. So I'd suggest to aggregate them into an object and provide maps for accessing the objects via any part of them. Something as trivial as this should do:

class IdNameAndFullName {
    String id, name, fullName;
}

class IdNameAndFullNameMaps {
    Map<String, IdNameAndFullName> byId;
    Map<String, IdNameAndFullName> byName;
    Map<String, IdNameAndFullName> byFullName;
}

You could obviously replace the class IdNameAndFullNameMaps by a Table. However, besides using a nice pre-existing data structure I see no advantages therein. The disadvantages are:

  • loss of efficiency
  • loss of readability (I wouldn't use Table here for the very same reason Tuple should be avoided)
  • use of String keys (your "hasId" and "hasName").
like image 35
maaartinus Avatar answered Nov 11 '22 21:11

maaartinus