Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive data processing performance using Java and SQLite

If you have an answer that is not Java / SQLite related, I'd be pleased to read it.

The environment

I store items in a database with the following scheme :

###################
#       Item      #    
###################
#      _id        #    This is the primary key
#    parent_id    #    If set, it the ID of the item containing this item
#      date       #    An ordinary date
#  geocontext_id  #    Foreign key to a pair of named coordinates
###################

###################
#   Geocontext    #    
###################
#       _id       #    This is the primary key
#       name      #    Way for the user to label a pair of coordinates (e.g : "home", "work")
#         x       #    One of the coordinate
#         y       #    The other one
###################

The problem

I must filter the items according to the geocontext and the date. It would be a easy job if items were all at the same level, but the trick is it's recursive. E.G :

root
      |_item 1
      |_item 2 
      |      |_item 4
      |      |_item 5
      |             |_item 6
      |_item 3
      |      |_item 8
      |             |_item 10
      |_item 11
      |       |_item 12
      |_item 7

There is no explicit limit to the recursive depth.

Now, if we are in any node and filter with the date "1st of April", we must see not only the items directly contained in the node that match the date, but we must see the items that contain items matching the date as well.

E.G : We are in "Items 2", if "item 6" matches the date, then we consider "item 5" matches the date too and we must keep it. If we are at the root, then item 2 must be displayed.

Same goes for the geocontext, but it's even harder because  :

  • It's stored in an other table.
  • Matching the context is a costly mathematical computation.

Of course, brute forcing the matching would result in the software to be slow and a very poor user experience.

NOTE : I don't need to display a tree. I display a list of filtered data from a tree. We must see only a flat list of the top elements. The challenge is to decide whether to display each element or not, according to all the children hierachy.

How I tried to solve it

I thought I could ease a bit the problem by using more tables to cache flat data :

###################
# Geocontex_cache #    
###################
#     item_id     #     I can Join the items table on this field
#     child_id    #     I can delete / update a child, and so delete / update the cache
#  geocontext_id  #     I can delete / update a geocontext, and so delete / update the cache
#        x        #      Here, I can brute force :-)
#        y        # 
###################

###################
#    Date_cache   #    
###################
#     item_id     #     
#     child_id    #    
#       date      #    
###################

This seems reasonable, but I haven't tried it yet. Nevertheless, it should the following drawbacks :

  • I moved the costly process to the get / set / create / delete methods that will have to manage the cached date. That will be a troublesome code to write and maintain. A five depth level item will triger a process that will hit recursively five parents.

  • The size ot the data base could become HUGE. A five de depth level item store cached data for five parents. Don't know if it's relevant, since this a a mono-user app with a manual input. I don't think anybody would insert more thatn 1000 items with more than 10 level of depth.

Now the good news is we go from the bottom of the pyramid to the top, not the other way, so it's not has horrible as it seems. When I will have to deal with parent item deletion, it will be another nice headache, but I save it for another question ;-).

Now my question

How would you store the data and process the filtering int the most optimal way ?

Optional :

Should I define an explicit recursive depth limit ? Should I perform the filtering using SQL or Java ? SQL surely will be faster, but matching the geocontext is far easier to do in Java.

As I am working on the Android platform, I have the following constraints :

  • Java is the only language available, and not with the entire standard lib.

  • SQLite is the only DBMS available.

  • Performance and memory are important issues. In case you have to choose, battery life and therefore performance is the priority.

  • Exotics external libs may not be able to be used.

P.S : I dug into SO and found some interesting pieces of information (espacially What is the most efficient/elegant way to parse a flat table into a tree?). It's a hint, but not a problem solver.

like image 950
e-satis Avatar asked Apr 04 '09 10:04

e-satis


1 Answers

1) First, let's look at simply putting everything in memory. This is simple, flexible, and above all, fast, solution. Drawbacks include the fact that you'll have to read everything into memory at startup (give the user a pretty loading bar and they won't even notice), and perhaps have to do a little extra work to ensure everything is reflected to disk when the user thinks it is, so that data isn't lost.

In this analysis I'm making some generic assumptions about Android/Dalvik I don't really know that much about, so hopefully it's somewhat accurate :) Remember the G1 has 192MB of RAM. Also, your assumption above was a max around 1000 items.

Object superclass ~ 8 bytes
parent/child pointer ~ 4 bytes
date (long) ~ 8 bytes
name (non interned string avg 32 chars) ~ 64 bytes
x point (int) ~ 4 bytes
y point (int) ~ 4 bytes

Total = 92 bytes + possible memory alignment + fudge factor = 128 bytes
1000 items = 125kB
10000 items = 1.22MB

Note: I realize that while a child can only have one pointer, a parent can have multiple children. However, the number of parent->child pointers is (elements - 1), so the average cost of parent->child pointer is (elements - 1)/elements ~ 1 element or 4 bytes. This assumes a child structure that doesn't allocate unused memory, such as a LinkedList (as opposed to an ArrayList)

2) The nerd in me says that this would be a fun place to profile a B+ tree, but I think that's overkill for what you want at the moment :) However, whatever solution you end up adopting, if you are not holding everything in memory, you will definitely want to cache as much of the top levels of the tree in memory as you can. This may cut down on the amount of disk activity drastically.

3) If you don't want to go all memory, another possible solution might be as follows. Bill Karwin suggests a rather elegant RDBMS structure called a Closure Table for optimizing tree based reads, while making writes more complex. Combining this with top level cache may give you performance benefits, although I would test this before taking my word on it:

When evaluating a view, use whatever you have in memory to evaluate as many children as you can. For those children that do not match, use an SQL join between the closure table and the flat table with an appropriate where clause to find out if there are any matching children. If so, you'll be displaying that node on your result list.

Hope this all makes sense and seems like it would work for what you need.

like image 57
sooniln Avatar answered Oct 22 '22 08:10

sooniln