Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get last modified date from a Hadoop Sequence File?

I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):

    public void map(Object key, Text value, Context context) 
throws IOException, InterruptedException {

    String uri = value.toString().replace(" ", "%20");
    Configuration conf = new Configuration();

    FSDataInputStream in = null;
    try {
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        in = fs.open(new Path(uri));
        java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
        byte buffer[] = new byte[1024 * 1024];

        while( in.read(buffer, 0, buffer.length) >= 0 ) {
            bout.write(buffer);
        }
        context.write(value, new BytesWritable(bout.toByteArray()));

I then have a second mapper that reads the HSF, thus:

public  class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{

    public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
        //get the PHash for this specific file
        String PHashStr;
        try {
            PHashStr = calculatePhash(value.getBytes());

and calculatePhash is:

        static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
        //get the PHash for this specific data
        //PHash requires inputstream rather than byte array
        InputStream is = new ByteArrayInputStream(imageData);
        String ph;
        try {
            ImagePHash ih = new ImagePHash();
            ph = ih.getHash(is);
            System.out.println ("file: " + is.toString() + " phash: " +ph);
        } catch (Exception e) {
            e.printStackTrace();
            return "Internal error with ImagePHash.getHash";
        } 

        return ph;

This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified() to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!

like image 244
schoon Avatar asked Nov 14 '14 18:11

schoon


People also ask

How do I read a sequence file?

To read a SequenceFile using Java API in Hadoop create an instance of SequenceFile. Reader. Using that reader instance you can iterate the (key, value) pairs in the SequenceFile using the next() method. Then you can read the previously written SequenceFile using the following command.

What is a sequence file in Hadoop?

A SequenceFile is a flat, binary file type that serves as a container for data to be used in Apache Hadoop distributed computing projects. SequenceFiles are used extensively with MapReduce.

Can files in HDFS be modified?

You can not modified data once stored in hdfs because hdfs follows Write Once Read Many model. You can only append the data once stored in hdfs.


1 Answers

In Hadoop each files are consist of BLOCK. Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs. If your input files are present in HDFS means you need to import the above package

FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));

org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();

Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);

I hope this will help you.

like image 182
ǨÅVËĔŊ RĀǞĴĄŅ Avatar answered Sep 20 '22 15:09

ǨÅVËĔŊ RĀǞĴĄŅ