Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up a slow multi-join query in a small Vertica database (~120K total rows, 10 minutes)

I'd love to get your help understanding why this join-heavy query takes about 10 minutes to run on a small database of seven tables totaling < 120K rows, and ideally get your suggestions on how to make it faster on our little cluster of four nodes. I've put supporting information here: https://gist.github.com/anonymous/8862796 (list of tables, list of fields by table, and table sizes), but following is the query and the EXPLAIN VERBOSE output. I ran ANALYZE_WORKLOAD() on this query and then followed its suggestion to run ANALYZE_STATISTICS on all tables. This resulted in no improvement. I then did its second suggestion of running the Database Designer, which resulted in even slower performance. I'd very much appreciate your help.

Profile information

Thanks to the tips below re: PROFILE. I ran it and put the results here: https://gist.github.com/anonymous/8935190 . It is 8K lines long, so perhaps I didn't run it correctly (details in the gist). Question: How do I start analyzing it?

Query backstory

The reason the query is messy is mainly because it's dynamically generated for each run of our machine learning research software that has to walk paths graph-style through the E-R tables involved, applying various conditions. In this case the path is [rates, movie, rates, ml_user, rates, movie, rates]. Queries are progressively built up during the program's exploration of a solution space, which is why there (currently) aren't the optimizations folks have kindly and rightly suggested by @wumpz and @Bohemian below, e.g., eliminating subselects. This means I'm kinda' stuck with its current form for the short term :-/

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------

 Opt Vertica Options
 --------------------
 PLAN_OUTPUT_SUPER_VERBOSE


 EXPLAIN VERBOSE
 SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val 
 FROM (SELECT id FROM rates) relVarTable0
 LEFT JOIN
 (SELECT rates1.id AS id, AVG(rates4.rating) AS val 
 FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4 
 WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL 
 GROUP BY rates1.id) relVarTable1
  ON relVarTable0.id = relVarTable1.id
 LEFT JOIN
 (SELECT rates1.id AS id, rates1.rating AS val 
 FROM rates rates1 
 WHERE rates1.rating IS NOT NULL ) relVarTable2
  ON relVarTable0.id = relVarTable2.id;

 Access Path:
 Sort Key: (V(1,1))
  LDISTRIB_UNSEGMENTED
 +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
 |  Join Cond: (relVarTable0.id = relVarTable2.id)
 |  Execute on: All Nodes
 |  Sort Key: (V(1,1))
 |  LDISTRIB_UNSEGMENTED
 | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
 | |      Join Cond: (relVarTable0.id = relVarTable1.id)
 | |      Execute on: All Nodes
 | |      Sort Key: (V(1,1))
 | |      LDISTRIB_UNSEGMENTED
 | | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
 | | | |      Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | |      Projection: movielens_test.rates_b0
 | | | |      Materialize: rates.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | |      LDISTRIB_SEGMENTED
 | | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
 | | | |      Aggregates: sum_float(<SVAR>), count(<SVAR>)
 | | | |      Group By: rates1.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id)
 | | | |      LDISTRIB_SEGMENTED
 | | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
 | | | | |      Join Cond: (movie2.id = rates4.movie_id)
 | | | | |      Join Filter: (rates3.id <> rates4.id)
 | | | | |      Execute on: All Nodes
 | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | |      Join Cond: (movie2.id = rates3.movie_id)
 | | | | | |      Join Filter: (movie1.id <> movie2.id)
 | | | | | |      Execute on: All Nodes
 | | | | | |      Runtime Filter: (SIP1(HashJoin): movie2.id)
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
 | | | | | | |      Join Cond: (ml_user1.id = rates3.ml_user_id)
 | | | | | | |      Join Filter: (rates2.id <> rates3.id)
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | |      Join Cond: (ml_user1.id = rates2.ml_user_id)
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Runtime Filter: (SIP2(HashJoin): ml_user1.id)
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
 | | | | | | | | |      Join Cond: (movie1.id = rates2.movie_id)
 | | | | | | | | |      Join Filter: (rates1.id <> rates2.id)
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
 | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | |      Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | | | |      Join Cond: (movie1.id = rates1.movie_id)
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | | |      Materialize: rates1.id, rates1.movie_id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | | | | | |      Materialize: movie1.id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
 | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | |      Projection: movielens_test.ml_user_b0
 | | | | | | | | |      Materialize: ml_user1.id
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
 | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | |      Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
 | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | |      Materialize: movie2.id
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
 | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | |      Projection: movielens_test.rates_b0
 | | | | | |      Materialize: rates4.rating, rates4.id, rates4.movie_id
 | | | | | |      Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | | | | |      Execute on: All Nodes
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
 | |      Execute on: All Nodes
 | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | |      LDISTRIB_UNSEGMENTED
 | | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
 | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | |      Projection: movielens_test.rates_b0
 | | |      Materialize: rates1.rating, rates1.id
 | | |      Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_SEGMENTED


 ------------------------------ 
like image 274
Matthew Cornell Avatar asked Feb 13 '23 20:02

Matthew Cornell


2 Answers

First I see too many NO STATISTICS in your explain plan. That's a bad idea and you should fix it.

See the order of your tables in the join? A hash join was created and you're doing a full tbl scan on your biggest table. Fix this by doing hash join(small table join big table) and not (big table join small table).

  1. Run DBD
  2. Run analyze
  3. Run an explain on it and make sure projections are used to answer your query
  4. Check if your movielens_test.rates can be partitioned
    • If running o a single node(MPP) wont be used as that is a big win
    • Run profile on your query and post it
    • Make sure you have the right data encoding applied on your DDL as well the order of your order by cols to address better your used predicated (I think this will be done by DBD - I always check)

As a final note that I always do:

Open your database log and watch it as you run your query. If you have data spilled on disk this can be your problem as your sort data is bigger then your allocated memory.

Another option is for you to create pre-joined projection on your first sub-query. But only if your data does not suffer many data changes as pre-joined projections are very bad with data loading on them.

like image 92
Up_One Avatar answered Feb 17 '23 13:02

Up_One


Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id -- i see this materialize happening on these column.

and run time filter is also not good, i can see many

Runtime Filter: (SIP2(HashJoin): ml_user1.id). try to use this in your where claus and check your projection being used in the query(you can check it from projection_usage table)

if you are joining two tables , please ensure projection are designed and segmented by hash on high cardinal columns and try to use them in your group by (if any).

if you see any RESGMENTED_MANY_ROWS. PLEASE USE SAME DATA TYPE COLUMNS IN YOUR PREDICATE/WHERE CLAUSE e.g. e.empid (number)=f.empid(number) data type can be integer but can't be float . (just an example) when you run this query please take the session id and transaction id and run below query select * from query_profiles where transaction_id<> and statement_id;

like image 32
user6932671 Avatar answered Feb 17 '23 13:02

user6932671