Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: Streaming without HDFS checkpoint

I'm implementing a Spark job which makes use of reduceByKeyAndWindow, therefore I need to add checkpointing.

From Spark's website I see that:

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved.

My application is just for academic purposes, thus I don't want to set an HDFS for checkpointing but just a local file. Doing so in MacOS works fine (setting a temporary dir as checkpoint dir), the problem comes when doing it in Windows, which throws an exception for permissions.

I already tried starting eclipse as administrator and creating the directory manually setting setWritable, setReadable and setExecutable to true. Any hint on how to overcome the problem in Windows?

Thanks!

Update Here's my code and exception. Just to clarify again, it works fine in Mac but not in Windows.

SparkConf conf = new SparkConf().setAppName("testApp").setMaster("local[2]");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaStreamingContext jsc = new JavaStreamingContext(ctx, new Duration(1000));
jsc.checkpoint(Files.createTempDir().getAbsolutePath());

Exception:

Exception in thread "pool-7-thread-3" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:135)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
like image 659
sergi123 Avatar asked Mar 30 '26 20:03

sergi123


2 Answers

Solved by adding the latest Hadoop libraries to my project.

If using Maven, the following set of dependencies do the trick.

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.3.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.10</artifactId>
        <version>1.2.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-twitter_2.11</artifactId>
        <version>1.3.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.6.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.6.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0</version>
    </dependency>

like image 98
sergi123 Avatar answered Apr 02 '26 15:04

sergi123


On Windows, you can solve this problem as following

  • Download winutils.exe to a folder say MY_UTILS/bin
  • Create an environment variable HADOOP_HOME and point it to MY_UTILS
like image 37
smishra Avatar answered Apr 02 '26 15:04

smishra