Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of <code>(filename1, content1), (filename2, content2) ...</code> into a big RDD. <pre class="prettyprint"><code>images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*") </code></pre> However, I got really confused what to do with the unicode representation of the image. Here is an example of one image/file: <pre class="prettyprint"><code>(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'\ufffd\ufffd\ufffd\ufffd\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\ufffd\ufffd\x01\x1eExif\x00\x00II*\x00\x08\x00\x00\x00\x08\x00\x12\x01\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00\x1a\x01\x05\x00\x01\x00\x00\x00n\x00\x00\x00\x1b\x01\x05\x00\x01\x00\x00\x00v\x00\x00\x00(\x01\x03\x00\x01\x00\x00\x00\x02\x00\x00\x001\x01\x02\x00\x0b\x00\x00\x00~\x00\x00\x002\x01\x02\x00\x14\x00\x00\x00\ufffd\x00\x00\x00\x13\x02\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00i\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00GIMP 2.8.2\x00\x002013:07:29 10:41:35\x00\x07\x00\x00\ufffd\x07\x00\x04\x00\x00\x000220\ufffd\ufffd\x02\x00\x04\x00\x00\x00407\x00\x00\ufffd\x07\x00\x04\x00\x00\x000100\x01\ufffd\x03\x00\x01\x00\x00\x00\ufffd\ufffd\x00\x00\x02\ufffd\x04\x00\x01\x00\x00\x00\x04\x04\x00\x00\x03\ufffd\x04\x00\x01\x00\x00\x00X\x01\x00\x00\x05\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00\x02\x00\x01\x00\x02\x00\x04\x00\x00\x00R98\x00\x02\x00\x07\x00\x04\x00\x00\x000100\x00\x00\x00\x00\ufffd\ufffd\x04_http://ns.adobe.com/xap/1.0/\x00<?xpacket begin=\'\ufeff\' id=\'W5M0MpCehiHzreSzNTczkc9d\'?>\n<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n <rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n <exif:Orientation>Top-left</exif:Orientation>\n <exif:XResolution>96</exif:XResolution>\n <exif:YResolution>96</exif:YResolution>\n <exif:ResolutionUnit>Inch</exif:ResolutionUnit>\n <exif:Software>ACD Systems Digital Imaging</exif:Software>\n <exif:DateTime>2013:07:29 10:37:00</exif:DateTime>\n <exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>\n <exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>\n <exif:SubsecTime>407</exif:SubsecTime>\n <exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>\n <exif:ColorSpace>Uncalibrated</exif:ColorSpace>\n </code></pre> Looking closer, there are actually some characters look like the metadata like <pre class="prettyprint"><code>... <x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n <rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n <exif:Orientation>Top-left</exif:Orientation>\n <exif:XResolution>96</exif:XResolution>\n <exif:YResolution>96</exif:YResolution>\n ... </code></pre> My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with. Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?

Your data looks like the raw bytes from a real image file (JPG?). The problem with your data is that it should be bytes, not unicode. You have to figure out how to convert from unicode to bytes. There is a whole can of worms full of encoding traps you have to deal with, but you may be lucky using <code>img.encode('iso-8859-1')</code>. I don't know and I will not deal with that in my answer. The raw data for a PNG image looks like this: <pre class="prettyprint"><code>rawdata = '\x89PNG\r\n\x1a\n\x00\x00...\x00\x00IEND\xaeB`\x82' </code></pre> Once you have it in bytes, you can create a PIL image from the raw data, and read it as a nparray: <pre class="prettyprint"><code>>>> from StringIO import StringIO >>> from PIL import Image >>> import numpy as np >>> np.asarray(Image.open(StringIO(rawdata))) array([[[255, 255, 255, 0], [255, 255, 255, 0], [255, 255, 255, 0], ..., [255, 255, 255, 0], [255, 255, 255, 0], [255, 255, 255, 0]]], dtype=uint8) </code></pre> All you need to make it work on Spark is <code>SparkContext.binaryFiles</code>: <pre class="prettyprint"><code>>>> images = sc.binaryFiles("path/to/images/") >>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata))) >>> images.values().map(image_to_array) </code></pre>

Spark using PySpark read images

Tags:

python

image

scipy

apache-spark

pyspark

Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of (filename1, content1), (filename2, content2) ... into a big RDD.

images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*")

However, I got really confused what to do with the unicode representation of the image.

Here is an example of one image/file:

(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'\ufffd\ufffd\ufffd\ufffd\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\ufffd\ufffd\x01\x1eExif\x00\x00II*\x00\x08\x00\x00\x00\x08\x00\x12\x01\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00\x1a\x01\x05\x00\x01\x00\x00\x00n\x00\x00\x00\x1b\x01\x05\x00\x01\x00\x00\x00v\x00\x00\x00(\x01\x03\x00\x01\x00\x00\x00\x02\x00\x00\x001\x01\x02\x00\x0b\x00\x00\x00~\x00\x00\x002\x01\x02\x00\x14\x00\x00\x00\ufffd\x00\x00\x00\x13\x02\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00i\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00GIMP 2.8.2\x00\x002013:07:29 10:41:35\x00\x07\x00\x00\ufffd\x07\x00\x04\x00\x00\x000220\ufffd\ufffd\x02\x00\x04\x00\x00\x00407\x00\x00\ufffd\x07\x00\x04\x00\x00\x000100\x01\ufffd\x03\x00\x01\x00\x00\x00\ufffd\ufffd\x00\x00\x02\ufffd\x04\x00\x01\x00\x00\x00\x04\x04\x00\x00\x03\ufffd\x04\x00\x01\x00\x00\x00X\x01\x00\x00\x05\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00\x02\x00\x01\x00\x02\x00\x04\x00\x00\x00R98\x00\x02\x00\x07\x00\x04\x00\x00\x000100\x00\x00\x00\x00\ufffd\ufffd\x04_http://ns.adobe.com/xap/1.0/\x00<?xpacket begin=\'\ufeff\' id=\'W5M0MpCehiHzreSzNTczkc9d\'?>\n<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n <rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n  <exif:Orientation>Top-left</exif:Orientation>\n  <exif:XResolution>96</exif:XResolution>\n  <exif:YResolution>96</exif:YResolution>\n  <exif:ResolutionUnit>Inch</exif:ResolutionUnit>\n  <exif:Software>ACD Systems Digital Imaging</exif:Software>\n  <exif:DateTime>2013:07:29 10:37:00</exif:DateTime>\n  <exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>\n  <exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>\n  <exif:SubsecTime>407</exif:SubsecTime>\n  <exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>\n  <exif:ColorSpace>Uncalibrated</exif:ColorSpace>\n

Looking closer, there are actually some characters look like the metadata like

...
<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n 
<rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n  
<exif:Orientation>Top-left</exif:Orientation>\n  
<exif:XResolution>96</exif:XResolution>\n  
<exif:YResolution>96</exif:YResolution>\n  
...

My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with.

Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?

813

asked Oct 15 '15 01:10

B.Mr.W.

1 Answers

Your data looks like the raw bytes from a real image file (JPG?). The problem with your data is that it should be bytes, not unicode. You have to figure out how to convert from unicode to bytes. There is a whole can of worms full of encoding traps you have to deal with, but you may be lucky using img.encode('iso-8859-1'). I don't know and I will not deal with that in my answer.

The raw data for a PNG image looks like this:

rawdata = '\x89PNG\r\n\x1a\n\x00\x00...\x00\x00IEND\xaeB`\x82'

Once you have it in bytes, you can create a PIL image from the raw data, and read it as a nparray:

>>> from StringIO import StringIO
>>> from PIL import Image
>>> import numpy as np
>>> np.asarray(Image.open(StringIO(rawdata)))

array([[[255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    ...,
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0]]], dtype=uint8)

All you need to make it work on Spark is SparkContext.binaryFiles:

>>> images = sc.binaryFiles("path/to/images/")
>>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))
>>> images.values().map(image_to_array)

175

answered Sep 22 '22 21:09

Paulo Scardine

Related questions
                            
                                Column with list of strings in python
                            
                                SQLAlchemy query for object with count of relationship
                            
                                Send method using generator. still trying to understand the send method and quirky behaviour
                            
                                Python - Any way to avoid several if-statements inside each other in a for-loop?
                            
                                Django, RabbitMQ, & Celery - why does Celery run old versions of my tasks after I update my Django code in development?
                            
                                Python requests, how to limit received size, transfer rate, and/or total time?
                            
                                Why python's list slicing doesn't produce index out of bound error? [duplicate]
                            
                                How to resize an HDF5 array with `h5py`
                            
                                Requests in Asyncio - Keyword Arguments
                            
                                Can't install Python-MySQL on OS X 10.10 Yosemite
                            
                                Why does a class definition always produce the same bytecode?
                            
                                Python cannot allocate memory using multiprocessing.pool
                            
                                How do you interpolate from an array containing datetime objects?
                            
                                Backup Odoo db from within odoo
                            
                                Why is it recommended to derive from Exception instead of BaseException class in Python?
                            
                                OrderedDict: are values ordered, too? [duplicate]
                            
                                How to map a series of conditions as keys in a dictionary?
                            
                                'numpy.ndarray' object has no attribute 'remove'
                            
                                Dictionary comprehension with inline functions
                            
                                How to print function arguments in sys.settrace?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With