I have a DynamoDB table that I need to connect to EMR Spark SQL to run queries on the table. I got the EMR Spark Cluster with release label emr-4.6.0 and Spark 1.6.1 on it.
I am referring to the document: Analyse DynamoDB Data with Spark
After connecting to the master node, I run the command:
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
It gives a warning:
Warning: Local jar /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not exist, skipping.
Later, when I import the DynamoDB Input Format using
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
It gives the error:
error: object dynamodb is not a member of package org.apache.hadoop
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
error: object dynamodb is not a member of package org.apache.hadoop
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
I think it is the jar that is causing this error. Where do I get this emr-ddb-hadoop.jar?
don't use spark-shell --jars, configuration in spark-default.cnf:
spark.driver.extraClassPath /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
spark.executor.extraClassPath /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
Later, import the DynamoDB Input Format is OK
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
The root cause of this problem is that emr-ddb-hadoop.jar is not available in the environment (or the location specified). In oder to install the dynamo DB libraries you have to select Hadoop 2.7.2 along with your applications of interest when you are creating the spark EMR cluster. Did you select that ?
If not launch a new cluster, go to advanced options and make sure Hadoop 2.7.2 is selected along with other applications.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With