How to tune spark job on EMR to write huge data quickly on S3

Tags:

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is about 260 GB into S3 takes very long time is more than 2 hours after that i cancelled because i have been changed heavily on EMR .

Here is my cluster info .

emr-5.9.0
Master:    m3.2xlarge
Core:      r4.16xlarge   10 machines (each machine has 64 vCore, 488 GiB memory,EBS Storage:100 GiB)

This is my cluster config that i am setting

capacity-scheduler  yarn.scheduler.capacity.resource-calculator :org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
emrfs-site  fs.s3.maxConnections:   200
spark   maximizeResourceAllocation: true
spark-defaults  spark.dynamicAllocation.enabled:    true

I tried setting memory component manually also like below and the performance was better but same thing it was taking again very long time

--num-executors 60--conf spark.yarn.executor.memoryOverhead=9216 --executor-memory 72G --conf spark.yarn.driver.memoryOverhead=3072 --driver-memory 26G --executor-cores 10 --driver-cores 3 --conf spark.default.parallelism=1200

I am not using default partition to save data into S3 .

Adding all details about the jobs and query plan so that it will be easy to understand .

The real reason is partition .And that is taking most of the time. Because i have 2K files so if i use re partition like 200 the output files comes in lakhs and then loading again in spark is not a good story .

In below image i dont know why sort is again called after project enter image description here

In below Image GC is too high for me ..Do oi have to handle this please suggest how? Executor and GC details

Below is nodes health status .t this point data is getting saved into S3 no wonder why i can see only two nodes are active and all are idle . This is my nodes details .At this point Data is getting saved into S3

This is the cluster details when it is loading ..At this point i can see cluster is fully utilized but while saving data into S3 many nodes are free . Fully Utilized clsuter

Finally here is my code where i perform Join and then save into S3...

import org.apache.spark.sql.expressions._

          val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "FinancialStatementLineItem_lineItemId").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp").desc)
          val latestForEachKey = df2resultTimestamp.withColumn("rank", row_number.over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

          val columnMap = latestForEachKey.columns.filter(c => c.endsWith("_1") & c != "FFAction|!|_1").map(c => c -> c.dropRight(2)) :+ ("FFAction|!|_1", "FFAction|!|")
          val exprs = columnMap.map(t => coalesce(col(s"${t._1}"), col(s"${t._2}")).as(s"${t._2}"))
          val exprsExtended = Array(col("uniqueFundamentalSet"), col("PeriodId"), col("SourceId"), col("StatementTypeCode"), col("StatementCurrencyId"), col("FinancialStatementLineItem_lineItemId")) ++ exprs

          //Joining both dara frame here
          val dfMainOutput = (dataMain.join(latestForEachKey, Seq("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "FinancialStatementLineItem_lineItemId"), "outer") select (exprsExtended: _*)).filter(!$"FFAction|!|".contains("D|!|"))
          //Joing ends here

          val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"PartitionYear", $"PartitionStatement", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").filter(_ != "PartitionYear").filter(_ != "PartitionStatement").map(c => col(c)): _*).as("concatenated"))

          val headerColumn = dataHeader.columns.toSeq

          val headerFinal = headerColumn.mkString("", "|^|", "|!|").dropRight(3)

          val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerFinal)

          //  dfMainOutputFinalWithoutNull.repartition($"DataPartition", $"PartitionYear", $"PartitionStatement")
  .write
  .partitionBy("DataPartition", "PartitionYear", "PartitionStatement")
  .format("csv")
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
  .option("nullValue", "")
  .option("delimiter", "\t")
  .option("quote", "\u0000")
  .option("header", "true")
  .option("codec", "bzip2")
  .save(outputFileURL)

430

asked Oct 15 '17 11:10

SUDARSHAN

1 Answers

You are running five c3.4large EC2 instances, which has 30gb of RAM each. So thats only 150GB in total which is much smaller than your >200GB dataframe to be joined. Hence lots of disk spill. Maybe you can launch r type EC2 instances (memory optimized opposed to c type which is computation optimized) instead, and see if there is a performance improvement.

134

answered Sep 22 '22 05:09

Fermat's Little Student

Related questions
                            
                                How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?
                            
                                How to prepare data into a LibSVM format from DataFrame?
                            
                                How to split a dataframe into dataframes with same column values?
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？
                            
                                How to remove nulls with array_remove Spark SQL Built-in Function
                            
                                Casting a new derived column in a DataFrame from boolean to integer
                            
                                Spark SQL converting string to timestamp
                            
                                How to get keys and values from MapType column in SparkSQL DataFrame
                            
                                Is there a way to add extra metadata for Spark dataframes?
                            
                                PySpark add a column to a DataFrame from a TimeStampType column
                            
                                PySpark: TypeError: condition should be string or Column
                            
                                Spark Dataframes UPSERT to Postgres Table
                            
                                SparkSQL : Can I explode two different variables in the same query?
                            
                                SparkSQL on pyspark: how to generate time series?
                            
                                Spark dataframe filter
                            
                                Spark Dataframe groupBy and sort results into a list
                            
                                how to write case with when condition in spark sql using scala
                            
                                How to do opposite of explode in PySpark?
                            
                                How to drop multiple column names given in a list from Spark DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to tune spark job on EMR to write huge data quickly on S3

Tags:

apache-spark-sql

hadoop2

spark-dataframe

amazon-emr

SUDARSHAN

People also ask

1 Answers

Fermat's Little Student

Recent Activity

Donate For Us