spark-redshift takes a lot of time to write to redshift

Tags:

I am working on setting up spark streamer with kinesis and redshift. I read data from kinesis after every 10 sec, process it and write it to redshift using spark-redshift lib.

The problem is it is taking hell lot of time to write only 300 rows.

This is what it shows me in the console

[Stage 56:====================================================> (193 + 1) / 200]

Looking at my logs df.write.format is doing this.

I have spark setup on a machine with 4 gb ram and 2 core amazon EC2, running with --master local[*] mode.

Here is how I create stream

kinesisStream = KinesisUtils.createStream(ssc, APPLICATION_NAME, STREAM_NAME, ENDPOINT, REGION_NAME, INITIAL_POS, CHECKPOINT_INTERVAL, awsAccessKeyId =AWSACCESSID, awsSecretKey=AWSSECRETKEY, storageLevel=STORAGE_LEVEL)    
CHECKPOINT_INTERVAL = 60
storageLevel = memory

kinesisStream.foreachRDD(writeTotable)
def WriteToTable(df, type):
    if type in REDSHIFT_PAGEVIEW_TBL:
        df = df.groupby([COL_STARTTIME, COL_ENDTIME, COL_CUSTOMERID, COL_PROJECTID, COL_FONTTYPE, COL_DOMAINNAME, COL_USERAGENT]).count()
        df = df.withColumnRenamed('count', COL_PAGEVIEWCOUNT)

        # Write back to a table

        url = ("jdbc:redshift://" + REDSHIFT_HOSTNAME + ":" + REDSHIFT_PORT + "/" +   REDSHIFT_DATABASE + "?user=" + REDSHIFT_USERNAME + "&password="+ REDSHIFT_PASSWORD)

        s3Dir = 's3n://' + AWSACCESSID + ':' + AWSSECRETKEY + '@' + BUCKET + '/' + FOLDER

        print 'Start writing to redshift'
        df.write.format("com.databricks.spark.redshift").option("url", url).option("dbtable", REDSHIFT_PAGEVIEW_TBL).option('tempdir', s3Dir).mode('Append').save()

        print 'Finished writing to redshift'

please let me know the reason for taking this much time

598

asked Mar 02 '16 06:03

Nipun

1 Answers

I have had similar experiences when writing to Redshift both through Spark and directly. spark-redshift will always write the data to S3 and then use the Redshift copy function to write the data to the target table. This approach is the best practice and the most efficient way to write large numbers of records. This approach also imposes a lot of overhead on writes, particularly when the number of records on each write is relatively small.

Looking at the output above, it appears that you have a large number of partitions (probably 200 or so). This is likely because the spark.sql.shuffle.partitions setting is set to 200 by default. You can find more details in the Spark documentation.

The group operation is probably generating 200 partitions. This means that you are doing 200 separate copy operations to S3 with each having the substantial associated latency in getting the connection and completing the write.

As we discussed in the comments below, and in chat, you can coalesce the result of the group by into fewer partitions making the following change to the line above:

df = df.coalesce(4).withColumnRenamed('count', COL_PAGEVIEWCOUNT)

This will reduce the number of partitions from 200 to 4 and the amount of overhead from the copies to S3 by a couple of orders of magnitude. You can experiment with the number of partitions to optimize performance. You could also change the spark.sql.shuffle.partitions setting to reduce the number of partitions given the size of data your are dealing with and the number of available cores.

177

answered Oct 12 '22 10:10

DemetriKots

Related questions
                            
                                Spark with Cassandra input/output
                            
                                Increase memory available to Spark shell
                            
                                How to transform a categorical variable in Spark into a set of columns coded as {0,1}?
                            
                                Geoip2's python library doesn't work in pySpark's map function
                            
                                Spark ml and PMML export
                            
                                Why are Spark Parquet files for an aggregate larger than the original?
                            
                                How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)
                            
                                Missing hive-site when using spark-submit YARN cluster mode
                            
                                AWS connection timeout when running Spark job on EMR
                            
                                Spark - how to get top N of rdd as a new rdd (without collecting at the driver)
                            
                                Apache Livy doesn't work with local jar file
                            
                                RDD CountApproximate taking far longer than requested timeout
                            
                                Limit kafka batch size when using Spark Structured Streaming
                            
                                RDD filter in scala spark
                            
                                pySpark Create DataFrame from RDD with Key/Value
                            
                                Spark streaming data sharing between batches
                            
                                A list as a key for PySpark's reduceByKey
                            
                                Spark crash while reading json file when linked with aws-java-sdk
                            
                                What is the difference between destroy() and unpersist()?
                            
                                Why does Spark fail with "Failed to get broadcast_0_piece0 of broadcast_0" in local mode?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark-redshift takes a lot of time to write to redshift

Tags:

apache-spark

amazon-redshift

spark-streaming

Nipun

People also ask

1 Answers

DemetriKots

Recent Activity

Donate For Us