I am running the wordcount java program in spark. How do I run it from the command line.

Pick up the wordcount example from say: https://github.com/holdenk/fastdataprocessingwithsparkexamples/tree/master/src/main/scala/pandaspark/examples. Follow these steps to create the fat jar file: <pre class="prettyprint"><code>mkdir example-java-build/; cd example-java-build mvn archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=spark.examples \ -DartifactId=JavaWordCount \ -Dfilter=org.apache.maven.archetypes:maven-archetype-quickstart cp ../examples/src/main/java/spark/examples/JavaWordCount.java JavaWordCount/src/main/java/spark/examples/JavaWordCount.java </code></pre> You add the relevant spark-core and spark examples dependencies. Make sure you have the dependencies based on your version of spark. I use spark 1.1.0 and so I have the relevant dependencies. My pom.xml looks like this: <pre class="prettyprint"><code> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-examples_2.10</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.1.0</version> </dependency> </dependencies> </code></pre> Build your jar file using mvn. <pre class="prettyprint"><code>cd example-java-build/JavaWordCount mvn package </code></pre> This creates your fat jar file inside the target directory. Copy the jar file to any location on the server. Go to the your bin folder of your spark. ( in my case: <code>/root/spark-1.1.0-bin-hadoop2.4/bin</code>) Submit spark job: My job looks like this: <pre class="prettyprint"><code>./spark-submit --class "spark.examples.JavaWordCount" --master yarn://myserver1:8032 /root/JavaWordCount-1.0-SNAPSHOT.jar hdfs://myserver1:8020/user/root/hackrfoe.txt </code></pre> Here --class is: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) The last argument is any text file of your choice for the program. The output should like this, giving word counts of all words in the text file. <pre class="prettyprint"><code>in: 17 sleeping.: 1 sojourns: 1 What: 4 protect: 1 largest: 1 other: 1 public: 1 worst: 1 hackers: 12 detected: 1 from: 4 and,: 1 secretly: 1 breaking: 1 football: 1 answer.: 1 attempting: 2 "hacker: 3 </code></pre> Hope this helps!

How to run a Spark-java program from command line [closed]

1 Answers

Pick up the wordcount example from say: https://github.com/holdenk/fastdataprocessingwithsparkexamples/tree/master/src/main/scala/pandaspark/examples. Follow these steps to create the fat jar file:

mkdir example-java-build/; cd example-java-build

mvn archetype:generate \
   -DarchetypeGroupId=org.apache.maven.archetypes \
   -DgroupId=spark.examples \
   -DartifactId=JavaWordCount \
   -Dfilter=org.apache.maven.archetypes:maven-archetype-quickstart

cp ../examples/src/main/java/spark/examples/JavaWordCount.java
JavaWordCount/src/main/java/spark/examples/JavaWordCount.java

You add the relevant spark-core and spark examples dependencies. Make sure you have the dependencies based on your version of spark. I use spark 1.1.0 and so I have the relevant dependencies. My pom.xml looks like this:

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>

<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-examples_2.10</artifactId>
        <version>1.1.0</version>
</dependency>
<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
</dependency>
  </dependencies>

Build your jar file using mvn.

cd example-java-build/JavaWordCount
mvn package

This creates your fat jar file inside the target directory. Copy the jar file to any location on the server. Go to the your bin folder of your spark. ( in my case: /root/spark-1.1.0-bin-hadoop2.4/bin)

Submit spark job: My job looks like this:

./spark-submit --class "spark.examples.JavaWordCount" --master yarn://myserver1:8032 /root/JavaWordCount-1.0-SNAPSHOT.jar  hdfs://myserver1:8020/user/root/hackrfoe.txt

Here --class is: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) The last argument is any text file of your choice for the program.

The output should like this, giving word counts of all words in the text file.

in: 17
sleeping.: 1
sojourns: 1
What: 4
protect: 1
largest: 1
other: 1
public: 1
worst: 1
hackers: 12
detected: 1
from: 4
and,: 1
secretly: 1
breaking: 1
football: 1
answer.: 1
attempting: 2
"hacker: 3

Hope this helps!

160

answered Oct 13 '22 10:10

user1189851

Related questions
                            
                                Does to_utc_timestamp take into account daylight saving?
                            
                                Hive View Partitions
                            
                                Compute Statistical mode in Hive
                            
                                Spark give Null pointer exception during InputSplit for Hbase
                            
                                how to pass variables in hive using hue
                            
                                Java or C++ API for Apache Drill
                            
                                Not able to fetch result from hive transaction enabled table through spark-sql
                            
                                Limit YARN containers programmatically
                            
                                How to make HDFS work in docker swarm
                            
                                Map Reduce Frameworks/Infrastructure
                            
                                0.20.2 API hadoop version with java 5
                            
                                Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB
                            
                                hadoop multiple already being created exception
                            
                                Using s3distcp with Amazon EMR to copy a single file
                            
                                Hive performance
                            
                                Hadoop ClassNotFoundException related to MapClass
                            
                                File jobtracker.info could only be replicated to 0 nodes, instead of 1
                            
                                Overriding RecordReader to read Paragraph at once instead of line
                            
                                Hadoop profile output - where and what?
                            
                                Getting "No space left on device" for approx. 10 GB of data on EMR m1.large instances

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run a Spark-java program from command line [closed]

Tags:

apache-spark

hadoop

hdfs

Pooja3101

People also ask

1 Answers

user1189851

Recent Activity

Donate For Us