Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

like image 269
flybonzai Avatar asked Apr 14 '16 22:04

flybonzai


People also ask

Can we load Parquet files in redshift?

COPY Command – Amazon Redshift recently added support for Parquet files in their bulk load command COPY.

Can you use spark SQL to read a Parquet data?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Can Spark write to redshift?

Spark on Qubole supports the Spark Redshift connector, which is a library that lets you load data from Amazon Redshift tables into Spark SQL DataFrames, and write data back to Redshift tables.


1 Answers

Sure thing, totally possible.

Scala code to read parquet (taken from here)

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here)

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
like image 172
evgenii Avatar answered Sep 30 '22 20:09

evgenii