Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apache phoenix Join query performance

I started using phoenix couple of months back. Below are the environment and version details.

Hadoop – Cloudera CDH 5.4.7-1. Phoenix – 4.3 – Phoenix which comes as parcels on CDH5.4.7-1. HBase version – HBase 1.0.0 JDK – 1.7.0_67 1 Master and 3 region servers.

We started to do a POC to evaluate Apache Phoenix. We have data in 12 different tables on Oracle DB. We get the data into Hadoop system using Oracle golden gate.

There are 12 different Phoenix tables each having 40-100 columns with a few hundred rows. We do a transformation process and then load into a final table. This is basic ETL which we are doing. The transformation process goes through a couple of intermediate stages where we populate intermediate tables. Hence there are “joins” between tables.

Everything worked great and we were able to implement the entire ETL process. I was very happy with the ease of use and implementation.

The issues started when we started with the Performance testing with millions of rows. Below are the issues.

  1. The intermediate transformation process crashes the region servers: Joining two tables, each with 20 Million rows. Secondary Index created on the column I am joining on. The two tables are salted with 9 buckets. This performed well if the number of rows resulted from the join is less than ~200k. The 200k rows take more than 10 minutes to execute. If the number of rows resulting is more, then the region servers start crashing. Test code explain select count(available_bal) from salted.b_acct2 ba inner join (select c_item_id from salted.m_item2 mi where s_info_id = 12345) as mi on ba.c_item_id = mi.c_item_id;

    +------------------------------------------+ | PLAN | +------------------------------------------+ | CLIENT 9-CHUNK PARALLEL 9-WAY FULL SCAN OVER SALTED.S2_BA_CI_IDX | | SERVER AGGREGATE INTO SINGLE ROW | | PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) | | CLIENT 9-CHUNK PARALLEL 9-WAY RANGE SCAN OVER SALTED.S_MI_SI_IDX [0,19,266] | | CLIENT MERGE SORT | | DYNAMIC SERVER FILTER BY TO_BIGINT("C_ITEM_ID") IN (MI.C_ITEM_ID) | +------------------------------------------+

  2. Joining 6 tables for the final transformation hangs: Joining 6 tables on indexed columns returns data for less than 1M rows. This takes 10-12 minutes. But if the join results are approximately more than 1M, it hangs and the result doesn’t come back. Initially I got InsufficientMemoryException, which I resolved by changing the configuration and increasing the memory available. I don’t get the InsufficientMemoryException again, but the query doesn’t execute for more than 20 min. We are expecting to execute within a few seconds.

Below are the parameters:

jvm.bootoptions= -Xms512m –Xmx9999M.
hbase.regionserver.wal.codec : org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec 
hbase.rpc.timeout 600000 
phoenix.query.timeoutMs 360000000 
phoenix.query.maxServerCacheBytes 10737418240 
phoenix.coprocessor.maxServerCacheTimeToLiveMs 900000 
hbase.client.scanner.timeout.period 600000 
phoenix.query.maxGlobalMemoryWaitMs 100000 
phoenix.query.maxGlobalMemoryPercentage 50 
hbase.regionserver.handler.count 50

Summary: The core issues being Slow execution of join queries and eventually crashing of region servers when the data goes beyond 1 million of rows. Is there a limitation on the execution? Please help me resolve these issues as we are going through an evaluation process and I don’t want to let go of Phoenix! If I am able to execute the above queries within quick time, then I would not hesitate to use Phoenix.

Regards, Shivamohan

like image 930
Shivamohan M P Avatar asked Oct 31 '22 10:10

Shivamohan M P


1 Answers

By default, Phoenix uses hash-joins, requiring the data to fit in memory. If you run into problems (with very large tables), you can increase the amount of memory allocated to Phoenix (config setting) or set a query "hint" (ie. SELECT /*+ USE_SORT_MERGE_JOIN*/ FROM ...) to use sort-merge joins which do not have the same requirement. They plan to auto-detect the ideal join algorithm in the future. Additionally, Phoenix currently supports only a subset of join operations.

like image 136
kliew Avatar answered Nov 11 '22 13:11

kliew