Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying an external configuration file for Apache Spark

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime.

~~~~~~~~~~Edit~~~~~~~~~~~

It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question. To see a simple solution (in Java Spark) on how to load a .properties file into a spark cluster, see my answer below.

original question below for reference purposes only.

~~~~~~~~~~~~~~~~~~~~~~~~

I want

  • Different configuration files depending on the environment (local, aws)
  • I'd like to specify application specific parameters

As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.

Java Spark Code

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
    public static void main(String[] args) {
        String inputFile = args[0]; // Should be some file on your system

        SparkConf conf = new SparkConf();// .setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(inputFile).cache();

        final String filterString = conf.get("filterstr");

        long numberLines = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains(filterString);
            }
        }).count();

        System.out.println("Line count: " + numberLines);
    }
}

Config File

the configuration file is based on https://spark.apache.org/docs/1.3.0/configuration.html and it looks like:

spark.app.name          test_app
spark.executor.memory   2g
spark.master            local
simplespark.filterstr   a

The Problem

I execute the application using the following arguments:

/path/to/inputtext.txt --conf /path/to/configfile.config

However, this doesn't work, since the exception

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

gets thrown. To me means the configuration file is not being loaded.

My questions are:

  1. What is wrong with my setup?
  2. Is specifying application specific parameters in the spark configuration file good practice?
like image 509
Alexander Avatar asked Apr 04 '15 00:04

Alexander


2 Answers

try this

--properties-file /path/to/configfile.config

then access in scala program as

sc.getConf.get("spark.app.name")
like image 155
Poojaa Karaande Avatar answered Oct 07 '22 06:10

Poojaa Karaande


So after a bit of time, I realized I was pretty confused. The easiest way to get a configuration file into memory is to use a standard properties file, put it into hdfs and load it from there. For the record, here is the code to do it (in Java Spark):

import java.util.Properties;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf()
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

InputStream inputStream;
Path pt = new Path("hdfs:///user/hadoop/myproperties.properties");
FileSystem fs = FileSystem.get(ctx.hadoopConfiguration());
inputStream = fs.open(pt);

Properties properties = new Properties();
properties.load(inputStream);
like image 31
Alexander Avatar answered Oct 07 '22 06:10

Alexander