Unable to read images simultaneously [in parallels] using pyspark

Tags:

I have 10 jpeg images in a directory. I want to read all them simultaneously using pyspark. I tried as follows.

from PIL import Image


from pyspark import SparkContext, SparkConf    

conf = SparkConf()
spark = SparkContext(conf=conf)       

files = glob.glob("E:\\tests\\*.jpg")

files_ = spark.parallelize(files)    

arrs = []

for fi in files_.toLocalIterator():      

    im = Image.open(fi)
    data = np.asarray(im)
    arrs.append(data)

img = np.array(arrs)    
print (img.shape)

The code ended without error and printed out img.shape; however, it did not run in parallel. Could you help me?

643

asked May 26 '21 13:05

Roman

Video Answer

2 Answers

You can use rdd.map to load and transform the pictures in parallel and then collect the rdd into a Python list:

files = glob.glob("E:\\tests\\*.jpg")

file_rdd = spark.parallelize(files)

def image_to_array(path):
    im = Image.open(path)
    data = np.asarray(im)
    return data

array_rdd = file_rdd.map(lambda f: image_to_array(f))
result_list = array_rdd.collect()

result_list is now a list with 10 elements, each element is a numpy.ndarray.

The function image_to_array will be executed on different Spark executors in parallel. If you have a multi-node Spark cluster, you have to make sure that all nodes can access E:\\tests\\.

After collecting the arrays, processing can continue with

img = np.array(result_list, dtype=object)

answered Sep 23 '22 19:09

werner

My solution follows the same idea from werner, but I did only using spark libs:

from pyspark.ml.image import ImageSchema
import numpy as np


df = (spark
      .read
      .format("image")
      .option("pathGlobFilter", "*.jpg")
      .load("your_data_path"))

df = df.select('image.*')

# Pre-caching the required schema. If you remove this line an error will be raised.
ImageSchema.imageFields

# Transforming images to np.array
arrays = df.rdd.map(ImageSchema.toNDArray).collect()

img = np.array(arrays)
print(img.shape)

answered Sep 24 '22 19:09

Kafels

Related questions
                            
                                Spark Dataframe upsert to Elasticsearch
                            
                                How to cast an array of struct in a spark dataframe using selectExpr?
                            
                                can't resolve ... given input columns
                            
                                Spark DataFrame is Untyped vs DataFrame has schema?
                            
                                Spark dataframe column naming conventions / restrictions
                            
                                Extract and Visualize Model Trees from Sparklyr
                            
                                Spark - Reading partitioned data from S3 - how does partitioning happen?
                            
                                How can I rename a PySpark dataframe column by index? (handle duplicated column names)
                            
                                Spark sampling options in JSON reader ignored?
                            
                                Pyspark DataFrame: Split column with multiple values into rows
                            
                                Group days into weeks with totals PySpark
                            
                                How to fix error on pyspark EMR Notebook - AnalysisException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
                            
                                How To Get Local Spark on AWS to Write to S3
                            
                                TypeError: 'JavaPackage' object is not callable (spark._jvm)
                            
                                Connecting to remote Dataproc master in SparkSession
                            
                                PySpark 2.4.5: IllegalArgumentException when using PandasUDF
                            
                                How to programmatically get information about executors in PySpark
                            
                                Python / Pyspark - Correct method chaining order rules
                            
                                Using regexp to join two dataframes in spark
                            
                                How to load json snappy compressed in HIVE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unable to read images simultaneously [in parallels] using pyspark

Tags:

parallel-processing

python-imaging-library

apache-spark

pyspark