I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
I then have a second mapper that reads the HSF, thus:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the PHash for this specific file
String PHashStr;
try {
PHashStr = calculatePhash(value.getBytes());
and calculatePhash is:
static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
//get the PHash for this specific data
//PHash requires inputstream rather than byte array
InputStream is = new ByteArrayInputStream(imageData);
String ph;
try {
ImagePHash ih = new ImagePHash();
ph = ih.getHash(is);
System.out.println ("file: " + is.toString() + " phash: " +ph);
} catch (Exception e) {
e.printStackTrace();
return "Internal error with ImagePHash.getHash";
}
return ph;
This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified()
to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!
To read a SequenceFile using Java API in Hadoop create an instance of SequenceFile. Reader. Using that reader instance you can iterate the (key, value) pairs in the SequenceFile using the next() method. Then you can read the previously written SequenceFile using the following command.
A SequenceFile is a flat, binary file type that serves as a container for data to be used in Apache Hadoop distributed computing projects. SequenceFiles are used extensively with MapReduce.
You can not modified data once stored in hdfs because hdfs follows Write Once Read Many model. You can only append the data once stored in hdfs.
In Hadoop each files are consist of BLOCK. Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs. If your input files are present in HDFS means you need to import the above package
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();
Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);
I hope this will help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With