Let us say I have two relatively large tab-delimited files file1.txt, file2.txt.
file1.txt
id\tcity\tcar\ttype\tmodel
file2.txt
id\tname\trating
Let us suppose that file1.txt has 2000 unique ids, and therefore 2000 unique rows, and file2.txt has only 1000 unique rows, and therefore 1000 unique ids. Is there a way to merge the two tables?
Case 1. merge them by id in file1.txt, where when there is no id in file2.txt NAs would be filled in.
Case2. merge them by id in file2.txt, where when only the ids in file2.txt will be printed out with the fields in file1.txt and file2.txt.
Note: the merged new files should also be tab-delimited file, with a header file as well. Note2. I'd also appreciate suggestions on how to do it when there is no header as well.
Thanks!
join -j 1 <(sort file1.txt) <(sort file2.txt)
Does your 'case 2' approach with only standard unix tools. Of course, if the files are sorted, you can drop the sort.
If you included the headers, you might rely on the ids being numerical for sorting the joined header to the top:
join -j 1 <(sort file1.txt) <(sort file2.txt) | sort -n
With
file1.txt
id city car type model
1 york subaru impreza king
2 kampala toyota corolla sissy
3 luzern chrysler gravity falcon
file2.txt
id name rating
3 zanzini PG
2 tara X
output:
id city car type model name rating
2 kampala toyota corolla sissy tara X
3 luzern chrysler gravity falcon zanzini PG
PS To preserve the TAB separator character, pass the -t
option:
join -t' ' ...
It's kind of hard to show on SO that ' ' contained a TAB character. Type it with ^VTAB (e.g. in bash)
This worked for me in case 1:
join -t $'\t' -1 1 -2 1 -a 1 -a 2 <(sort fileone.txt) <(sort filetwo.txt) | sort -n -t $'\t' > filethree.txt
then:
awk '{if(NF+0<7) printf "%s\tNA\tNA\n", $0; else print $0}' filethree.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With