using pyspark, read/write 2D images on hadoop file system

Tags:

I want to be able to read / write images on an hdfs file system and take advantage of the hdfs locality.

I have a collection of images where each image is composed of

2D arrays of uint16
basic additional information stored as an xml file.

I want to create an archive over hdfs file system, and use spark for analyzing the archive. Right now I am struggling over the best way to store the data over hdfs file system in order to be able to take full advantage of spark+hdfs structure.

From what I understand, the best way would be to create a sequenceFile wrapper. I have two questions :

Is creating a sequenceFile wrapper the best way ?
Does anybody have any pointer to examples I could use to start with ? I must not be first one that needs to read something different than text file on hdfs through spark !

542

asked Feb 25 '15 22:02

MathiasOrtner

1 Answers

I have found a solution that works : using the pyspark 1.2.0 binaryfile does the job. It is flagged as experimental, but I was able to read tiff images with a proper combination of openCV.

import cv2
import numpy as np

# build rdd and take one element for testing purpose
L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1)

# convert to bytearray and then to np array
file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8)

# use opencv to decode the np bytes array 
R = cv2.imdecode(file_bytes,1)

Note the help of pyspark :

binaryFiles(path, minPartitions=None)

    :: Experimental

    Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

    Note: Small files are preferred, large file is also allowable, but may cause bad performance.

104

answered Oct 13 '22 19:10

MathiasOrtner

Related questions
                            
                                How to bundle many files in S3 using Spark
                            
                                In which folder or where actually the fsimage and edit log files are stored for the namenode to read and merge during the startup?
                            
                                How to set the number of partitions for newAPIHadoopFile?
                            
                                YARN Application Master unable to connect to Resource Manager
                            
                                how to add external jar to hadoop job?
                            
                                How to change the version of Java that CDH uses
                            
                                Performance of Apache Drill
                            
                                What does the Spark UI light blue part of Tasks progress bar indicate?
                            
                                Is Cassandra for OLAP or OLTP or both?
                            
                                Cannot load main class from JAR file
                            
                                What does virtual core in YARN vcore mean?
                            
                                Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
                            
                                How do we convert a string into Array in hive?
                            
                                Why does all columns get created as string when I use OpenCSVSerde in Hive?
                            
                                How HBase partitions table across regionservers?
                            
                                Hive/HBase Integration - Zookeeper Session Closes Immediately
                            
                                Debugging in PIG UDF
                            
                                How can I force Flume-NG to process the backlog of events after a sink failed?
                            
                                How to remove an ambari service after they have been added
                            
                                What is the difference between classic, local for mapreduce.framework.name in mapred-site.xml?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

using pyspark, read/write 2D images on hadoop file system

Tags:

apache-spark

hadoop

pyspark

sequencefile

MathiasOrtner

People also ask

1 Answers

MathiasOrtner

Recent Activity

Donate For Us