I am new to the Hadoop framework and would really appreciate it if someone could walk me thru this.
I am trying to merge two .csv files.
The two files have the same headers are orderd the same, etc.
The thing is that I have no idea how to merge these files to one and then clean the empty lines and unused columns.
The two files have the same headers are orderd the same etc
Since the files are the same, you can upload them to the same directory.
hdfs dfs -mkdir -p /path/to/input
hdfs dfs -put file1.csv /path/to/input
hdfs dfs -put file2.csv /path/to/input
HDFS will natively treat these as "parts of a single file" if you read from hdfs:///path/to/input
Note, you'll want to strip the header from both files before placing them into HDFS in this fashion.
Another option would be to concatenate the files locally. (Again, remove the headers first, or at least from all but the first file)
cat file1.csv file2.csv > file3.csv
hdfs dfs -put file3.csv /path/to/input
After that, use whatever Hadoop tools you know to read the files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With