Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging multiple sequence files into one sequencefile within Hadoop

How to merge multiple sequence files into one sequence file within Hadoop Thanks.

like image 343
cldo Avatar asked Dec 07 '12 03:12

cldo


People also ask

How do I combine multiple files into one in HDFS?

Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system. We want to merge the 2 files present inside are HDFS i.e. file1. txt and file2. txt, into a single file output.

How many formats are present in SequenceFile in Hadoop?

SequenceFile Formats Essentially there are 3 different formats for SequenceFile s depending on the CompressionType specified. All of them share a common header described below.

Are sequence files Splittable?

A sequence file stores data in rows as binary key/value pairs. The binary format makes it smaller than a text file. Sequence files are splittable.

What is sequential file in Hadoop?

A SequenceFile is a flat, binary file type that serves as a container for data to be used in Apache Hadoop distributed computing projects. SequenceFiles are used extensively with MapReduce.


2 Answers

If you want to merge multiple files into single file then here is two ans :

Native language


getmerge

Usage: hadoop fs -getmerge <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.



Java API


org.apache.hadoop.fs.FileUtil.copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String addString);

Copy all files in a directory to one output file (merge)

Copy to hdfs

put

Usage: hadoop dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem.

copyFromLocal

Usage: hadoop dfs -copyFromLocal <localsrc> URI

Similar to put command, except that the source is restricted to a local file reference.

like image 83
saurabh shashank Avatar answered Oct 21 '22 06:10

saurabh shashank


Have you considered forqlift? I wrote it to handle certain SequenceFile chores, including SequenceFile merges.

In your case, you could run:

forqlift seq2seq --file new_combined_file.seq  \
    original_file1.seq  original_file2.seq original_file3.seq ...

Granted, forqlift's seq2seq tool is marked "experimental" ... but it's worked well on my (admittedly limited) internal testing.

like image 23
qethanm Avatar answered Oct 21 '22 05:10

qethanm