I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, <code>new FaceDetection(photoInputStream).isDog()</code> is all the underlying logic for the analysis. Specifically, Should I upload all of the photos to <code>HDFS</code>? Assume yes, <ol> <li>how can I use them in the <code>map</code> method? </li> <li>Is it ok to make the input(to the <code>map</code>) as a text file containing all of the photo path(in <code>HDFS</code>) with each a line, and in the map method, load the binary like: <code>photoInputStream = getImageFromHDFS(photopath);</code> (Actually, what is the right method to load file from HDFS during the execution of the map method?)</li> </ol> It seems I miss some knowledges about the basic principle for <code>hadoop</code>, <code>map/reduce</code> and <code>hdfs</code>, but can you please point me out in terms of the above question, Thanks!

<blockquote> <code>I have 10M+ photos saved on the local file system.</code> </blockquote> Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion. Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :) According to the Hadoop - The Definitive Guide <blockquote> As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. </blockquote> So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job. There should be a better way of solving this problem. <hr> Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.

Hadoop: how to access (many) photo images to be processed by map/reduce?

Tags:

hadoop

mapreduce

hdfs

I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis.

Specifically, Should I upload all of the photos to HDFS? Assume yes,

how can I use them in the map method?
Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

It seems I miss some knowledges about the basic principle for hadoop, map/reduce and hdfs, but can you please point me out in terms of the above question, Thanks!

945

asked Jan 06 '12 02:01

leslie

2 Answers

how can I use them in the map method?

The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into SequenceFiles (one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.

Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.

Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).

If you do want to do this, you can use the FileSystem API to create files (you want the open method).

answered Oct 17 '22 16:10

Donald Miner

I have 10M+ photos saved on the local file system.

Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion.

Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :)

According to the Hadoop - The Definitive Guide

As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.

So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job.

There should be a better way of solving this problem.

Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.

answered Oct 17 '22 16:10

Praveen Sripati

Related questions
                            
                                Load only particular field in PIG?
                            
                                what is /tmp directory in hadoop hdfs?
                            
                                How to convert an 500GB SQL table into Apache Parquet?
                            
                                Hbase client ConnectionLoss for /hbase error
                            
                                Connect to S3 data from PySpark
                            
                                spark over kubernetes vs yarn/hadoop ecosystem [closed]
                            
                                how to sort word count by value in hadoop? [duplicate]
                            
                                What is the path to directory within Hadoop filesystem?
                            
                                Streaming data and Hadoop? (not Hadoop Streaming)
                            
                                Output a list from a Hadoop Map Reduce job using custom writable
                            
                                Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
                            
                                Hadoop HDFS copy with wildcards?
                            
                                Hive error: parseexception missing EOF
                            
                                Default number of reducers
                            
                                A starting point for learning how to implement MapReduce/Hadoop in Python?
                            
                                reuse JVM in Hadoop mapreduce jobs
                            
                                Why the sshd service is unrecognized?
                            
                                Is there any official Docker images for Hadoop?
                            
                                Can i point multiple location to same hive external table?
                            
                                HBase Error - assignment of -ROOT- failure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With