Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use TwitterUtils in Spark shell?

Tags:

apache-spark

I'm trying to use the twitterUtils in the Spark Shell (where they are not available by default).

I've added the following to spark-env.sh:

SPARK_CLASSPATH="/disk.b/spark-master-2014-07-28/external/twitter/target/spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar"

I can now execute

import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._

without an error in the shell, which would not be possible without added the jar to the classpath ("error: object twitter is not a member of package org.apache.spark.streaming"). However, I will get an error when executing this in the Spark shell:

scala> val ssc = new StreamingContext(sc, Seconds(1))
ssc: org.apache.spark.streaming.StreamingContext =
org.apache.spark.streaming.StreamingContext@6e78177b

scala> val tweets = TwitterUtils.createStream(ssc, "twitter.txt")
error: bad symbolic reference. A signature in TwitterUtils.class refers to
term twitter4j in package <root> which is not available.
It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling
TwitterUtils.class.

What am I missing? Do I have to import another jar?

like image 538
helm Avatar asked Aug 01 '14 16:08

helm


1 Answers

Yep, you need the Twitter4J JARs in addition to the spark-streaming-twitter one you already have. Specifically, the Spark devs suggest using Twitter4J version 3.0.3.

After you download the correct JARs, you'll want to pass them to the shell via the --jars flag. I think you can also do this via SPARK_CLASSPATH as you've done.

Here's how I did it on a Spark EC2 cluster:

#!/bin/bash
cd /root/spark/lib
mkdir twitter4j

# Get the Spark Streaming JAR.
curl -O "http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-twitter_2.10/1.0.0/spark-streaming-twitter_2.10-1.0.0.jar"

# Get the Twitter4J JARs. Check out http://twitter4j.org/archive/ for other versions.
TWITTER4J_SOURCE=twitter4j-3.0.3.zip
curl -O "http://twitter4j.org/archive/$TWITTER4J_SOURCE"
unzip -j ./$TWITTER4J_SOURCE "lib/*.jar" -d twitter4j/
rm $TWITTER4J_SOURCE

cd
# Point the shell to these JARs and go!
TWITTER4J_JARS=`ls -m /root/spark/lib/twitter4j/*.jar | tr -d '\n'`
/root/spark/bin/spark-shell --jars /root/spark/lib/spark-streaming-twitter_2.10-1.0.0.jar,$TWITTER4J_JARS
like image 166
Nick Chammas Avatar answered Nov 08 '22 23:11

Nick Chammas