Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

Tags:

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ... with below .yaml file, it works like a charm.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-job
  namespace: my-namespace
spec:
  type: Python
  pythonVersion: "3"
  hadoopConf:
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "fs.gs.project.id": "our-project-id"
    "fs.gs.system.bucket": "gcs-bucket-name"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/keyfile.json"
  mode: cluster
  image: "image-registry/spark-base-image"
  imagePullPolicy: Always
  mainApplicationFile: ./sparkjob.py
  deps:
    jars:
      - https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.4.5/spark-sql-kafka-0-10_2.11-2.4.5.jar
  sparkVersion: "2.4.5"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark-operator-spark
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.5
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id

The Docker image spark-base-image is built with Dockerfile

FROM gcr.io/spark-operator/spark-py:v2.4.5

RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/28.0-jre/guava-28.0-jre.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-2.0.1/gcs-connector-hadoop2-2.0.1-shaded.jar $SPARK_HOME/jars

ENTRYPOINT [ "/opt/entrypoint.sh" ]

the main application file is uploaded to GCS when submitting the application and subsequently fetched from there and copied into the driver pod upon starting the application. The problem starts whenever I want to supply my own Python module deps.zip as a dependency to be able to use it in my main application file sparkjob.py.

Here's what I have tried so far:

Added the following lines to spark.deps in sparkjob.yaml

pyFiles:
   - ./deps.zip

which resulted in the operator not being able to even submit the Spark application with error

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

./deps.zip is successfully uploaded to the GCS bucket along with the main application file but while the main application file can be successfully fetched from GCS (I see this in the logs in jobs with no dependencies as defined above), ./deps.zip can somehow not be fetched from there. I also tried adding the gcs-connector jar to the spark.deps.jars list explicitly - nothing changes.

I added ./deps.zip to the base docker image used for starting up the driver and executor pods by adding COPY ./deps.zip /mnt/ to the above Dockerfile and adding the dependency in the sparkjob.yaml via

pyFiles:
    - local:///mnt/deps.zip

This time the spark job can be submitted and the driver pod is started, however I get a file:/mnt/deps.zip not found error when the Spark context is being initialized I also tried to additionally set ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile but without any success. I even tried to explicitly mount the whole /mnt/ directory into the driver and executor pods using volume mounts, but that also didn't work.

edit:

My workaround (2), adding dependencies to the Docker image and setting ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile actually worked! Turns out the tag didn't update and I've been using an old version of the Docker image all along. Duh.

I still don't know why the (more elegant) solution 1 via the gcs-connector isn't working, but it might be related to MountVolume.Setup failed for volume "spark-conf-volume"

973

asked Jun 18 '20 11:06

denise

1 Answers

If the zip files contain jar which you always shall require while running your spark job, facing a similar issue I just added

FROM gcr.io/spark-operator/spark-py:v2.4.5
COPY mydepjars/ /opt/spark/jars/

And everything is getting loaded within my spark session. Could be one way to do it.

answered Nov 15 '22 06:11

Devesh

Related questions
                            
                                Docker Images and OS security updates
                            
                                Stackdriver Log Agent - Log Level Irrelevant with Google Cloud Logging Driver for Docker
                            
                                docker build - Avoid ADDing files only needed at build time
                            
                                Azure Web App for Containers not setting Environment Variables
                            
                                Unset environment variable in compose
                            
                                Should I add a DMZ in front of Kubernetes?
                            
                                MYSQL Docker container gives "unknown database" error
                            
                                Docker volume in swarm
                            
                                Why does the Visual Studio Build Tools installer return immediately without installing anything?
                            
                                RuntimeError: cannot cache function '__jaccard': no locator available for file '/usr/local/lib/python3.7/site-packages/librosa/util/matching.py'
                            
                                Bind mount docker with specific user and permission [closed]
                            
                                How Docker Copies files to windows?
                            
                                How to mount volume to docker on OSX?
                            
                                Running pre-commit hooks (e.g. pylint) when developing with docker
                            
                                Kubernetes: Failed to pull image. Server gave HTTP response to HTTPS client
                            
                                Dockerfile fails when installing a python local package with permission denied
                            
                                Can you use "expo build:ios" on a CI environment with apple 2 factor authentication and how
                            
                                Every query on SQLite DB using EF Core in Docker is very slow, unless there is an explicit call to OpenConnection
                            
                                Docker hyperkit process CPU usage going crazy. How to keep it under control?
                            
                                why is my docker container ASP.NET core app not available after ending debugging in Visual Studio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

Tags:

docker

dependency-management

kubernetes

apache-spark

pyspark

denise

People also ask

1 Answers

Devesh

Recent Activity

Donate For Us