How to mount S3 bucket on Kubernetes container/pods?

Tags:

I am trying to run my spark job on Amazon EKS cluster. My spark job required some static data (reference data) at each data nodes/worker/executor and this reference data is available at S3.

Can somebody kindly help me to find out a clean and performant solution to mount S3 bucket on pods ?

S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again.

790

asked Aug 03 '18 12:08

Ajeet

2 Answers

We recently opensourced a project that looks to automate this steps for you: https://github.com/IBM/dataset-lifecycle-framework

Basically you can create a dataset:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "iQkv3FABR0eywcEeyJAQ"
    secretAccessKey: "MIK3FPER+YQgb2ug26osxP/c8htr/05TVNJYuwmy"
    endpoint: "http://192.168.39.245:31772"
    bucket: "my-bucket-d4078283-dc35-4f12-a1a3-6f32571b0d62"
    region: "" #it can be empty

And then you will get a pvc you can mount in your pods

140

answered Sep 16 '22 12:09

Yiannis Gkoufas

in general, you just don't do that. You should instead interact directly with S3 API to retrieve/store what you need (probably via some tools like aws cli).

As you run in AWS, you can have IAM configured in a way that your nodes can access particular data authorized on "infrastructure" level, or you can provide S3 access tokens via secrets/confogmaps/env etc.

S3 is not a filesystem, so don't expect it to behave like one (even if there are FUSE clients that emulate FS for your needs, this is rarely the right solution)

answered Sep 20 '22 12:09

Radek 'Goblin' Pieczonka

Related questions
                            
                                Read and write empty string "" vs NULL in Spark 2.0.1
                            
                                Apache Spark - Dealing with Sliding Windows on Temporal RDDs
                            
                                Caching intermediate results in Spark ML pipeline
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE
                            
                                Dealing with a large gzipped file in Spark
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                local class incompatible Exception: when running spark standalone from IDE
                            
                                Disadvantages of Spark Dataset over DataFrame
                            
                                Why spark.ml don't implement any of spark.mllib algorithms?
                            
                                Preserve index-string correspondence spark string indexer
                            
                                How can set the default spark logging level?
                            
                                Meaning of Apache Spark warning "Calling spill() on RowBasedKeyValueBatch"
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                Extract information from a `org.apache.spark.sql.Row`
                            
                                What is the right way to save\load models in Spark\PySpark
                            
                                How to run independent transformations in parallel using PySpark?
                            
                                How to limit functions.collect_set in Spark SQL?
                            
                                Airflow SparkSubmitOperator - How to spark-submit in another server
                            
                                Why does Spark RDD partition has 2GB limit for HDFS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to mount S3 bucket on Kubernetes container/pods?

Tags:

amazon-s3

kubernetes

apache-spark

s3fs

fuse

Ajeet

People also ask

2 Answers

Yiannis Gkoufas

Radek 'Goblin' Pieczonka

Recent Activity

Donate For Us