How to optimize spark sql to run it in parallel

Tags:

I am a spark newbie and have a simple spark application using Spark SQL/hiveContext to:

select data from hive table (1 billion rows)
do some filtering, aggregation including row_number over window function to select first row, group by, count() and max(), etc.
write the result into HBase (hundreds million rows)

I submit the job to run it on yarn cluster (100 executors), it's slow and when I looked at the DAG Visualization in Spark UI, it seems only the hive table scan tasks were running in parallel, rest of steps #2, and #3 above are only running in one instance which probably should be able to optimize to be parallelized?

The application looks like:

Step 1:

val input = hiveContext
  .sql(
     SELECT   
           user_id  
           , address  
           , age  
           , phone_number  
           , first_name  
           , last_name  
           , server_ts   
       FROM  
       (     
           SELECT  
               user_id  
               , address  
               , age  
               , phone_number  
               , first_name  
               , last_name  
               , server_ts   
               , row_number() over 
                (partition by user_id, address,  phone_number, first_name, last_name  order by user_id, address, phone_number, first_name, last_name,  server_ts desc, age) AS rn  
           FROM  
           (  
               SELECT  
                   user_id  
                   , address  
                   , age  
                   , phone_number  
                   , first_name  
                   , last_name  
                   , server_ts  
               FROM  
                   table   
               WHERE  
                   phone_number <> '911' AND   
                   server_date >= '2015-12-01' and server_date < '2016-01-01' AND  
                   user_id IS NOT NULL AND  
                   first_name IS NOT NULL AND  
                   last_name IS NOT NULL AND  
                   address IS NOT NULL AND  
                   phone_number IS NOT NULL AND  
           ) all_rows  
       ) all_rows_with_row_number  
       WHERE rn = 1)

val input_tbl = input.registerTempTable(input_tbl)

Step 2:

val result = hiveContext.sql(
  SELECT state, 
         phone_number, 
         address, 
         COUNT(*) as hash_count, 
         MAX(server_ts) as latest_ts 
     FROM  
    ( SELECT  
         udf_getState(address) as state  
         , user_id  
         , address  
         , age  
         , phone_number  
         , first_name  
         , last_name  
         , server_ts  
     FROM  
         input_tbl ) input  
     WHERE state IS NOT NULL AND state != ''  
     GROUP BY state, phone_number, address)

Step 3:

result.cache()
result.map(x => ...).saveAsNewAPIHadoopDataset(conf)

The DAG Visualization looks like: enter image description here

As you can see, the "Filter", "Project" and "Exchange" in stage 0 are only running in one instance, so does the stage1 and stage2, so a few questions and apologies if the question is dumb:

Does "Filter", "Project" and "Exchange" happen in Driver after data shuffling from each executor?
What code maps to "Filter", "Project" and "Exchange"?
how I could run "Filter", "Project" and "Exchange" in parallel to optimize the performance?
is it possible to run stage1 and stage2 in parallel?

682

asked Apr 27 '16 00:04

user_not_found

2 Answers

You're not reading the DAG graph correctly - the fact that each step is visualized using a single box does not mean that it isn't using multiple tasks (and therefore cores) to calculate that step.

You can see how many tasks are used for each step by drilling-down into the stage view, that displays all tasks for this stage.

For example, here's a sample DAG visualization similar to yours:

enter image description here

You can see each stage is depicted by a "single" column of steps.

But if we look at the table below, we can see the number of tasks per stage:

enter image description here

One of them is using only 2 tasks, but the other uses 220, which means data is split into 220 partitions and partitions are processed in parallel, given enough available resources.

If you drill-down into that stage, you can see again that it used 220 tasks and details for all the tasks.

enter image description here

Only tasks reading data from disk are shown in graph as having these "multiple dots" to help you understand how many files were read.

SO - as Rashid's answer suggestes, check the number of tasks for each stage.

answered Sep 23 '22 21:09

Tzach Zohar

It is not obvious so I would do following things to zero in the problem.

Calculate execution time of each steps.
First step may be slow if your table is of text format, spark usually works better if data is stored in Hive in parquet format.
See if your table is partitioned by the column used in where clause.
If saving data to Hbase is slow then you may need to pre-split hbase table as by default data is stored in a single region.
Look at stages tab in spark ui to see how many tasks are started for each stage and also look for data local level as describe here

Hopefully, you will be able to zero in the problem.

answered Sep 23 '22 21:09

Rashid Ali

Related questions
                            
                                Stored procedure using SP_SEND_DBMAIL sending duplicate emails to all recipients
                            
                                SQL and fuzzy comparison
                            
                                How to get the number of newlines in a cell value in PostgreSQL
                            
                                Performance of nested select
                            
                                Why is the default size of nvarchar is 255 (MSSQL Server)?
                            
                                Can I use PDO::FETCH_GROUP to group results by two values
                            
                                oracle sql developer: 00904. 00000 - "%s: invalid identifier". Where is my fault?
                            
                                Best practices for the order of joined columns in a sql join?
                            
                                Inserting a child node in an XMLTYPE column
                            
                                How to Convert SQL table into Redis Data
                            
                                Trouble Creating a MySQL Function through PDO
                            
                                Mysql: inner join on primary key for 2 IDs gives "Range checked for each record"
                            
                                SELECT from table with Varying IN list in WHERE clause
                            
                                Join same table twice for count in different columns
                            
                                Querying a nullable @OneToOne relationship with JPA
                            
                                How to select max of count in PostgreSQL
                            
                                H2 DB - Column must be in Group By list
                            
                                Calling multi-statement TVF with different parameters in separate CTEs showing wrong results
                            
                                Get last message from each conversation
                            
                                Why does an insert that groups by the primary key throw a primary key constraint violation error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize spark sql to run it in parallel

Tags:

sql

parallel-processing

apache-spark

apache-spark-sql

hadoop-yarn

user_not_found

People also ask

2 Answers

Tzach Zohar

Rashid Ali

Recent Activity

Donate For Us