Logo Questions Linux Laravel Mysql Ubuntu Git Menu

HDFS: Using HDFS API to append to a SequenceFile




I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.

// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
               keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));

Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.

I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.

I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.

Any thoughts or work-arounds? Thanks.

like image 555
Matt D Avatar asked Dec 27 '22 22:12

Matt D

1 Answers

Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139

For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140

CDH5 users can find the same ability in version CDH 5.7.1 onwards.

like image 166
2 revs Avatar answered Jan 05 '23 16:01

2 revs