Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using pyspark, read/write 2D images on hadoop file system

I want to be able to read / write images on an hdfs file system and take advantage of the hdfs locality.

I have a collection of images where each image is composed of

  • 2D arrays of uint16
  • basic additional information stored as an xml file.

I want to create an archive over hdfs file system, and use spark for analyzing the archive. Right now I am struggling over the best way to store the data over hdfs file system in order to be able to take full advantage of spark+hdfs structure.

From what I understand, the best way would be to create a sequenceFile wrapper. I have two questions :

  • Is creating a sequenceFile wrapper the best way ?
  • Does anybody have any pointer to examples I could use to start with ? I must not be first one that needs to read something different than text file on hdfs through spark !
like image 542
MathiasOrtner Avatar asked Feb 25 '15 22:02

MathiasOrtner


People also ask

How do I write to HDFS in spark?

Write & Read CSV & TSV file from HDFS read. csv("path") , replace the path to HDFS. And Write a CSV file to HDFS using below syntax. Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.

How do Hadoop and Spark work together?

How does Spark relate to Apache Hadoop? Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.

How read multiple files in spark which are present in HDFS cluster?

Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.


1 Answers

I have found a solution that works : using the pyspark 1.2.0 binaryfile does the job. It is flagged as experimental, but I was able to read tiff images with a proper combination of openCV.

import cv2
import numpy as np

# build rdd and take one element for testing purpose
L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1)

# convert to bytearray and then to np array
file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8)

# use opencv to decode the np bytes array 
R = cv2.imdecode(file_bytes,1)

Note the help of pyspark :

binaryFiles(path, minPartitions=None)

    :: Experimental

    Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

    Note: Small files are preferred, large file is also allowable, but may cause bad performance.
like image 104
MathiasOrtner Avatar answered Oct 13 '22 19:10

MathiasOrtner