We have huge amounts of server data stored in <code>S3</code> (soon to be in a <code>Parquet</code> format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using <code>Spark</code> to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

Sure thing, totally possible. Scala code to read parquet (taken from here) <pre class="prettyprint"><code>val people: RDD[Person] = ... people.write.parquet("people.parquet") val parquetFile = sqlContext.read.parquet("people.parquet") //data frame </code></pre> Scala code to write to redshift (taken from here) <pre class="prettyprint"><code>parquetFile.write .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") .option("dbtable", "my_table_copy") .option("tempdir", "s3n://path/for/temp/data") .mode("error") .save() </code></pre>

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

Tags:

amazon-s3

apache-spark

apache-spark-sql

hadoop

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

269

asked Apr 14 '16 22:04

flybonzai

1 Answers

Sure thing, totally possible.

Scala code to read parquet (taken from here)

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here)

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

172

answered Sep 30 '22 20:09

evgenii

Related questions
                            
                                Spark csv reading speed is very slow although I increased the number of nodes
                            
                                Which is the easiest way to combine small HDFS blocks?
                            
                                how does netezza work? how does it compare to Hadoop?
                            
                                Hive doesn't work on install
                            
                                Changing user in python
                            
                                Amazon EC2 vs PiCloud [closed]
                            
                                hadoop hdfs points to file:/// not hdfs://
                            
                                error in hive metadata: org.apache.thrift.transport.TTransportException: java.net
                            
                                Deleting jobs from oozie's web UI?
                            
                                Accessing files in HDFS using Java
                            
                                Hadoop Pig count number
                            
                                HDFS error: target already exists
                            
                                Hive is not showing tables
                            
                                Data visualisation tools availble on hive hadoop
                            
                                Create HIVE partitioned table HDFS location assistance
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                HDInsight: HBase or Azure Table Storage?
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With