How to connect to Amazon Redshift or other DB's in Apache Spark?

Question

I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.

Sumit · Accepted Answer

Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

Create a df by giving appropriate details:

myDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://host:port/db_name") \
    .option("dbtable", "table_name") \
    .option("user", "user_name") \
    .option("password", "password") \
    .load()

Spark Version: 2.2

Evan Zamir · Answer

It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table"
)

Hope this helps someone!

How to connect to Amazon Redshift or other DB's in Apache Spark?

Tags:

amazon-web-services

amazon-s3

apache-spark

amazon-redshift

Evan Zamir

2 Answers

Sumit

Evan Zamir

Recent Activity

Donate For Us

How to connect to Amazon Redshift or other DB's in Apache Spark?

Tags:

amazon-web-services

amazon-s3

apache-spark

amazon-redshift

Evan Zamir

2 Answers

Sumit

Evan Zamir

Related questions

Recent Activity

Donate For Us