Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Mondrian / OLAP the wrong tool for joining large dimensions/sets?

Summary: Most of the examples I've seen of MDX joins have involved joining relatively small sets, say with tens or hundreds of items each. But I find myself also wanting to try joining (in particular "non-empty joining") sets that have thousands or tens of thousands of items each, and it's not working well so far. I'm wondering if this could be made to work, or if I perhaps need to consider using something other than Mondrian/OLAP.

To be concrete, I have a cube that records interactions between Firms (n=7000) and Clients (n=27000). Currently both Firm and Client are completely flat hierarchies; there's the All level and the individual-company level, with no other levels in between. There is a central fact table, and separate dimension tables for Firms and for Clients.

My users at least appear to want to get summary reports along these lines, aggregating all the non-empty interactions between Firms and Clients:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

But this query and variations on it don't work in my test Mondrian setup. Either I get an OutOfMemoryException (on a 2GB Java heap), or Java seems to spend impossibly long time in mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor). (I can provide a more complete stack trace if it would help.) By "impossibly long" I mean Java will stay slaving away at the query for hours and hours before I give up.

I initially expected the above query to perform ok, because conceptually it could be done somewhat efficiently by just doing a SQL query along these lines:

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

(In fact, if I execute something like this directly in MySql it doesn't take more than 15sec to execute.)

But from the debug logs Mondrian doesn't seem to attempt this optimization. Instead it appears to be doing the join internally, and in a way that ends up being particularly slow. I've set mondrian.native.crossjoin.enable=true in my mondrian.properties, but this doesn't seem like one of the join types that Mondrian is able to "make native". (If I turn on mondrian.native.unsupported.alert=ERROR then I get the corresponding exception.)

I'm left wondering whether I need to prevent my users from attempting joins on such large dimensions/sets, or whether Mondrian is maybe not the tool I'm looking for here. But maybe I'm just doing something wrong.

like image 420
Chris Avatar asked Nov 19 '11 00:11

Chris


3 Answers

To follow up, I tried setting up an analogous cube in Sql Server Analysis Services (Sql Server 2008), and it seems that icCube has a point about different OLAP tools performing differently:

Even before I learned much about SSAS best practices, performance on this type of MDX was much improved. A query along these lines

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

went from being non-viable with Mondrian to taking around ten seconds under Sql Server. Conceivably this has to do with MS' Business Intelligence Development Studio guiding me into creating a MOLAP cube by default, or perhaps SSAS has a smarter query planner.

In any case, perhaps this is fast enough for me. If not, I'm not yet sure how much more optimized SSAS can get in this case. (One disappointing thing is that, even when I re-run the query a second time, it still takes about 10 sec; I was hoping caching might have a more dramatic effect.)

Tangentially, you may notice in the just-quoted MDX I've replaced my original NonEmptyCrossJoin with a normal crossjoin combined with NON EMPTY. This is because, at least in the Sql Server world, NonEmptyCrossJoin is apparently regarded as a deprecated bad practice. (This is noted in Microsoft's MDX Language Reference. Mosha, one of the former SSAS devs, describes the situation in an article called MDX: NonEmpty, Exists and evil NonEmptyCrossJoin. Short version is that NonEmptyCrossJoin has confusing semantics and limited application, and since Sql Server 2005 or so, the query optimizer has been smart enough to make your query fast without NonEmptyCrossJoin.) So I've substituted a more modern approved equivalent in the above MDX. (It does still work with NonEmptyCrossJoin as well, though NonEmptyCrossJoin does not speed things up at all.)

like image 146
Chris Avatar answered Oct 29 '22 22:10

Chris


I'm not 100% sure, but have you tried setting:

mondrian.native.nonempty.enable = true

This optimisation seems to push some operations like that down to the sql level - Sounds like it could help.

like image 34
Codek Avatar answered Oct 29 '22 22:10

Codek


I'll answer the part of OLAP. There three big families of OLAP tools. ROLAP, MOLAP and HOLAP.

ROLAP, Relational, are build on a relation database. MDX request, if cache is missed, are performed in a relational database using a SQL statement. They have an advantage of scalability, by delagation, but depened for their performance on the underlying data base. QoS might be tricky as it's the db QoS.

MOLAP, InMemory, copy the data into internal structures (memory). Here the QoS, answer times, are more stable and faster as all processing is done in the same server. The issue with MOLAP is scalability as you might get out-of memory (>100mio).

HOLAP those are a mixed of ROLAP and MOLAP. I've no direct experience but in theory they can bring the best of both worlds.

Looking at the numbers you should not get any problem with MOLAP tools, it's really a small cube.

So, before leaving the OLAP world, give a chance to the MOLAP servers. For a list of OLAP servers you can check wikipedia

like image 43
ic3 Avatar answered Oct 29 '22 22:10

ic3