Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS EMR and Spark 1.0.0

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster.

I am creating the cluster using something like :

./elastic-mapreduce --create --alive \
--name "ll_Spark_Cluster" \
--bootstrap-action s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \
--bootstrap-name "Spark/Shark" \
--instance-type m1.xlarge \
--instance-count 2 \
--ami-version 3.0.4

The issue is that whenever I try to get data from S3 I get an exception. So if I start the spark-shell and try something like :

val data = sc.textFile("s3n://your_s3_data")

I get the following exception :

WARN storage.BlockManager: Putting block broadcast_1 failed
java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
like image 857
Eras Avatar asked Aug 21 '14 07:08

Eras


People also ask

Is EMR same as Spark?

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS. An EMR cluster with Spark is very different to an EMR Presto cluster: EMR is a big data framework that allows you to automate provisioning, tuning, etc. for big data workloads.

Can Spark Mllib run on EMR?

You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.

Can we run PySpark on EMR?

You can use AWS Step Functions to run PySpark applications as EMR Steps on an existing EMR cluster. Using Step Functions, we can also create the cluster, run multiple EMR Steps sequentially or in parallel, and finally, auto-terminate the cluster.


1 Answers

This issue was caused by the guava library,

The version that's on the AMI is 11 while spark needs version 14.

I edited the bootstrap script from AWS to install spark 1.0.2 and update the guava library during the bootstrap action you can get the gist here :

https://gist.github.com/tnbredillet/867111b8e1e600fa588e

Even after updating guava I still had an issue. When I tried to save data on S3 I had an exception thrown

lzo.GPLNativeCodeLoader - Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path

I solved that by adding the hadoop native library to the java.library.path. When I run a job I add the option

 -Djava.library.path=/home/hadoop/lib/native 

or if I run a job through spark-submit I add the

--driver-library-path /home/hadoop/lib/native 

argument.

like image 188
Eras Avatar answered Sep 26 '22 11:09

Eras