Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ... with below .yaml file, it works like a charm.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-job
  namespace: my-namespace
spec:
  type: Python
  pythonVersion: "3"
  hadoopConf:
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "fs.gs.project.id": "our-project-id"
    "fs.gs.system.bucket": "gcs-bucket-name"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/keyfile.json"
  mode: cluster
  image: "image-registry/spark-base-image"
  imagePullPolicy: Always
  mainApplicationFile: ./sparkjob.py
  deps:
    jars:
      - https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.4.5/spark-sql-kafka-0-10_2.11-2.4.5.jar
  sparkVersion: "2.4.5"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark-operator-spark
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.5
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id

The Docker image spark-base-image is built with Dockerfile

FROM gcr.io/spark-operator/spark-py:v2.4.5

RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/28.0-jre/guava-28.0-jre.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-2.0.1/gcs-connector-hadoop2-2.0.1-shaded.jar $SPARK_HOME/jars

ENTRYPOINT [ "/opt/entrypoint.sh" ]

the main application file is uploaded to GCS when submitting the application and subsequently fetched from there and copied into the driver pod upon starting the application. The problem starts whenever I want to supply my own Python module deps.zip as a dependency to be able to use it in my main application file sparkjob.py.

Here's what I have tried so far:

1

Added the following lines to spark.deps in sparkjob.yaml

pyFiles:
   - ./deps.zip

which resulted in the operator not being able to even submit the Spark application with error

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

./deps.zip is successfully uploaded to the GCS bucket along with the main application file but while the main application file can be successfully fetched from GCS (I see this in the logs in jobs with no dependencies as defined above), ./deps.zip can somehow not be fetched from there. I also tried adding the gcs-connector jar to the spark.deps.jars list explicitly - nothing changes.

2

I added ./deps.zip to the base docker image used for starting up the driver and executor pods by adding COPY ./deps.zip /mnt/ to the above Dockerfile and adding the dependency in the sparkjob.yaml via

pyFiles:
    - local:///mnt/deps.zip

This time the spark job can be submitted and the driver pod is started, however I get a file:/mnt/deps.zip not found error when the Spark context is being initialized I also tried to additionally set ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile but without any success. I even tried to explicitly mount the whole /mnt/ directory into the driver and executor pods using volume mounts, but that also didn't work.


edit:

My workaround (2), adding dependencies to the Docker image and setting ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile actually worked! Turns out the tag didn't update and I've been using an old version of the Docker image all along. Duh.

I still don't know why the (more elegant) solution 1 via the gcs-connector isn't working, but it might be related to MountVolume.Setup failed for volume "spark-conf-volume"

like image 973
denise Avatar asked Jun 18 '20 11:06

denise


People also ask

What is Spark operator in Kubernetes?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

Can we run Spark on Kubernetes?

Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark. The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints.

What are the described benefits of running Spark v2 3 +) on Kubernetes as an additional deployment mode?

Deploying Spark on Kubernetes gives you powerful features for free such as the use of namespaces and quotas for multitenancy control, and role-based access control (optionally integrated with your cloud provider IAM) for fine-grained security and data access.


1 Answers

If the zip files contain jar which you always shall require while running your spark job, facing a similar issue I just added

FROM gcr.io/spark-operator/spark-py:v2.4.5
COPY mydepjars/ /opt/spark/jars/

And everything is getting loaded within my spark session. Could be one way to do it.

like image 86
Devesh Avatar answered Nov 15 '22 06:11

Devesh