Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pre-package external libraries when using Spark on a Mesos cluster

According to the Spark on Mesos docs one needs to set the spark.executor.uri pointing to a Spark distribution:

val conf = new SparkConf()
  .setMaster("mesos://HOST:5050")
  .setAppName("My app")
  .set("spark.executor.uri", "<path to spark-1.4.1.tar.gz uploaded above>")

The docs also note that one can build a custom version of the Spark distribution.

My question now is whether it is possible/desirable to pre-package external libraries such as

  • spark-streaming-kafka
  • elasticsearch-spark
  • spark-csv

which will be used in mostly all of the job-jars I'll submit via spark-submit to

  • reduce the time sbt assembly need to package the fat jars
  • reduce the size of the fat jars which need to be submitted

If so, how can this be achieved? Generally speaking, are there some hints on how the fat jar generation on job submitting process can be speed up?

Background is that I want to run some code-generation for Spark jobs, and submit these right away and show the results in a browser frontend asynchronously. The frontend part shouldn't be too complicated, but I wonder how the backend part can be achieved.

like image 227
Tobi Avatar asked Aug 28 '15 07:08

Tobi


2 Answers

Create sample maven project with your all dependencies and then use maven plugin maven-shade-plugin. It will create one shade jar in your target folder.

Here is sample pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com</groupId>
    <artifactId>test</artifactId>
    <version>0.0.1</version>
    <properties>
        <java.version>1.7</java.version>
        <hadoop.version>2.4.1</hadoop.version>
        <spark.version>1.4.0</spark.version>
        <version.spark-csv_2.10>1.1.0</version.spark-csv_2.10>
        <version.spark-avro_2.10>1.0.0</version.spark-avro_2.10>
    </properties>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <!-- <minimizeJar>true</minimizeJar> -->
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>org/bdbizviz/**</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <finalName>spark-${project.version}</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency> <!-- Hadoop dependency -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>servlet-api</artifactId>
                    <groupId>javax.servlet</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>guava</artifactId>
                    <groupId>com.google.guava</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.4</version>
        </dependency>

        <dependency> <!-- Spark Core -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark SQL -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark CSV -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-csv_2.10</artifactId>
            <version>${version.spark-csv_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Avro -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-avro_2.10</artifactId>
            <version>${version.spark-avro_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Hive -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark Hive thriftserver -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
</project>
like image 153
Kaushal Avatar answered Oct 14 '22 15:10

Kaushal


When you say pre-package do you really mean distribute to all the slaves and set up the jobs to use those packages so that you don't need to download those every time? That might be an option, however it sounds a bit cumbersome because distributing everything to the slaves and keeping all the packages up to date is not an easy task.

How about breaking your .tar.gz into smaller pieces, so that instead of a single fat file your jobs fetch several smaller files? In this case it should be possible to leverage the Mesos Fetcher Cache. So you'll see bad performance when the agent cache is cold, but once it warms up (i.e. once one job runs and downloads the common files locally) consecutive jobs will complete faster.

like image 34
hartem Avatar answered Oct 14 '22 15:10

hartem