merging CSV files in Hadoop [closed]

Question

I am new to the Hadoop framework and would really appreciate it if someone could walk me thru this.

I am trying to merge two .csv files.

The two files have the same headers are orderd the same, etc.

The thing is that I have no idea how to merge these files to one and then clean the empty lines and unused columns.

OneCricketeer · Accepted Answer

The two files have the same headers are orderd the same etc

Since the files are the same, you can upload them to the same directory.

hdfs dfs -mkdir -p /path/to/input
hdfs dfs -put file1.csv /path/to/input
hdfs dfs -put file2.csv /path/to/input

HDFS will natively treat these as "parts of a single file" if you read from hdfs:///path/to/input

Note, you'll want to strip the header from both files before placing them into HDFS in this fashion.

Another option would be to concatenate the files locally. (Again, remove the headers first, or at least from all but the first file)

cat file1.csv file2.csv > file3.csv
hdfs dfs -put file3.csv /path/to/input

After that, use whatever Hadoop tools you know to read the files.

Donate For Us