I am running the wordcount java program in spark. How do I run it from the command line.
Use spark://HOST:PORT for Standalone cluster, replace the host and port of stand-alone cluster. Use local to run locally with a one worker thread. Use local[k] and specify k with the number of cores you have locally, this runs application with k worker threads.
Type 'javac MyFirstJavaProgram. java' and press enter to compile your code. If there are no errors in your code, the command prompt will take you to the next line (Assumption: The path variable is set). Now, type ' java MyFirstJavaProgram ' to run your program.
Go to the Apache Spark Installation directory from the command line and type bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language. If you have set the Spark in a PATH then just enter spark-shell in command line or terminal (mac users).
Pick up the wordcount example from say: https://github.com/holdenk/fastdataprocessingwithsparkexamples/tree/master/src/main/scala/pandaspark/examples. Follow these steps to create the fat jar file:
mkdir example-java-build/; cd example-java-build
mvn archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DgroupId=spark.examples \
-DartifactId=JavaWordCount \
-Dfilter=org.apache.maven.archetypes:maven-archetype-quickstart
cp ../examples/src/main/java/spark/examples/JavaWordCount.java
JavaWordCount/src/main/java/spark/examples/JavaWordCount.java
You add the relevant spark-core and spark examples dependencies. Make sure you have the dependencies based on your version of spark. I use spark 1.1.0 and so I have the relevant dependencies. My pom.xml looks like this:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-examples_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.0</version>
</dependency>
</dependencies>
Build your jar file using mvn.
cd example-java-build/JavaWordCount
mvn package
This creates your fat jar file inside the target directory.
Copy the jar file to any location on the server.
Go to the your bin folder of your spark. ( in my case: /root/spark-1.1.0-bin-hadoop2.4/bin
)
Submit spark job: My job looks like this:
./spark-submit --class "spark.examples.JavaWordCount" --master yarn://myserver1:8032 /root/JavaWordCount-1.0-SNAPSHOT.jar hdfs://myserver1:8020/user/root/hackrfoe.txt
Here --class is: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) The last argument is any text file of your choice for the program.
The output should like this, giving word counts of all words in the text file.
in: 17
sleeping.: 1
sojourns: 1
What: 4
protect: 1
largest: 1
other: 1
public: 1
worst: 1
hackers: 12
detected: 1
from: 4
and,: 1
secretly: 1
breaking: 1
football: 1
answer.: 1
attempting: 2
"hacker: 3
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With