Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge/join two tables fast linux command line

Let us say I have two relatively large tab-delimited files file1.txt, file2.txt.

file1.txt
id\tcity\tcar\ttype\tmodel

file2.txt 
id\tname\trating

Let us suppose that file1.txt has 2000 unique ids, and therefore 2000 unique rows, and file2.txt has only 1000 unique rows, and therefore 1000 unique ids. Is there a way to merge the two tables?

Case 1. merge them by id in file1.txt, where when there is no id in file2.txt NAs would be filled in.

Case2. merge them by id in file2.txt, where when only the ids in file2.txt will be printed out with the fields in file1.txt and file2.txt.

Note: the merged new files should also be tab-delimited file, with a header file as well. Note2. I'd also appreciate suggestions on how to do it when there is no header as well.

Thanks!

like image 557
Dnaiel Avatar asked Dec 20 '22 13:12

Dnaiel


2 Answers

join -j 1 <(sort file1.txt) <(sort file2.txt)

Does your 'case 2' approach with only standard unix tools. Of course, if the files are sorted, you can drop the sort.

If you included the headers, you might rely on the ids being numerical for sorting the joined header to the top:

join -j 1 <(sort file1.txt) <(sort file2.txt) | sort -n

With

  • file1.txt

    id  city    car type    model
    1   york    subaru  impreza king
    2   kampala toyota  corolla sissy
    3   luzern  chrysler    gravity falcon
    
  • file2.txt

    id  name    rating
    3   zanzini PG
    2   tara    X
    
  • output:

    id  city    car type    model   name    rating
    2   kampala toyota  corolla sissy   tara    X
    3   luzern  chrysler    gravity falcon  zanzini PG
    

PS To preserve the TAB separator character, pass the -t option:

 join -t'    ' ...

It's kind of hard to show on SO that ' ' contained a TAB character. Type it with ^VTAB (e.g. in bash)

like image 89
sehe Avatar answered Jan 05 '23 04:01

sehe


This worked for me in case 1:

join -t $'\t' -1 1 -2 1 -a 1 -a 2 <(sort fileone.txt) <(sort filetwo.txt) | sort -n -t $'\t' > filethree.txt

then:

awk '{if(NF+0<7) printf "%s\tNA\tNA\n", $0; else print $0}' filethree.txt

like image 24
AWE Avatar answered Jan 05 '23 05:01

AWE