Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join 3 files using awk?

Tags:

awk

Below are three files -

emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10
7698|KING|MANAGER|7839|01-MAY-81|2850||10
7782|CLARK|MANAGER|7839|09-JUN-81|2450||10
7566|JONES|MANAGER|7839|02-APR-81|2975||40
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50
7902|FORD|ANALYST|7566|03-DEC-81|3000||20


dept.txt
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON

citycode.txt
1123|NEW YORK
1124|DALLAS
1125|CHICAGO
1126|BOSTON
1127|WASHINGTON

expected output : $0(from emp.txt where $6 >= 2850)+$2(from dept.txt)+$1(from citycode.txt)

7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

Explanation - join these files using common key of emp.txt($NF : outer join on emp.txt--record no 50 doesn't have common that is why it will print NULL) and dept.txt($1) where emp($6 >= 2850) and use common column of dept.txt($NF) and citycode.txt($1). joining two files is an easy task using common key but how can we join three files together.

like image 583
VIPIN KUMAR Avatar asked Feb 06 '23 00:02

VIPIN KUMAR


2 Answers

Use the built-in FILENAME variable to build the appropriate associative arrays when the city and department codes are being processed.

Then when processing the employees, output the desired data:

BEGIN {FS=OFS="|"}
function chknull(s) { return s ? s : "NULL"}
FILENAME == "citycode.txt" {citycode[$2]=$1}
FILENAME == "dept.txt" {depname[$1]=$2; depcity[$1]=$3}
FILENAME == "emp.txt" && $6 >= 2850 {
    print $0, chknull(depname[$NF]), chknull(citycode[depcity[$NF]])
}


$ awk -f j3.awk citycode.txt dept.txt emp.txt 
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

(Note that I also added the chknull function to print "NULL" instead of empty string to match the desired output from the question, but that's not important for the meat of the question which is how to do joins with more than two files.)

like image 67
jas Avatar answered May 15 '23 05:05

jas


awk 'BEGIN{ FS=OFS="|" }
     FNR==1{ f++ }
     f==1{ d[$1]=$2; c[$NF]=$1; next }
     f==2 && $NF in c{ c[c[$NF]]=$1; next }
     $6 >=2850{
       print $0, ($NF in d ? d[$NF] : "NULL" ),( $NF in c ? c[$NF] : "NULL")
   }' dept.txt citycode.txt emp.txt

Explanation

awk '                               # Call awk
      BEGIN{          
              FS=OFS="|"            # Set input and output separator
      }
      FNR==1{                       # FNR will be 1 for each file when awk reads first line
            f++                     # File counter
      }

      # use can also do FILENAME == "dept.txt" here but f==1 is easy

      f==1{                         # for first file dept.txt
            d[$1]=$2;               # set array d where index is $1 and value is $2 of file dept.txt 
            c[$NF]=$1;              # set array c where index is last field and value is first field of file dept.txt
            next                    # stop processing go to next line
      }

      # when f equal 2 that is when we read second file citycode.txt 
      # and last field of current file exists in array c
      # there is no point in reading extra line exists in citycode.txt
      # we just need whichever city exists in dept.txt
      # so $NF in c

      f==2 && $NF in c{             

           # modify array c by empid 
           # that is index will be the value of array c 
           # corresponding to last field of current file and 
           # array c value will be citycode
           # Its because to make it easy to access array c value by empid while
           # reading emp.txt file

            c[c[$NF]]=$1;           

           # gawk user can delete array element here like below
           # if you have gawk uncomment below line
           # delete c[$NF]

            next                   # stop processing go to next line
      }

      $6 >=2850{                   # here we read emp.txt if 6th field is greater than or equal to 2850

            # Print current record/row/line from emp.txt

            # if last field of current file that is 
            # empid exists in array d then print department else NULL

            # if last field of current file that is
            # empid exists in array c then print citycode else NULL

            print $0,($NF in d?d[$NF]:"NULL"),($NF in c?c[$NF]:"NULL")
      }
   ' dept.txt citycode.txt emp.txt

Input

$ cat emp.txt 
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10
7698|KING|MANAGER|7839|01-MAY-81|2850||10
7782|CLARK|MANAGER|7839|09-JUN-81|2450||10
7566|JONES|MANAGER|7839|02-APR-81|2975||40
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50
7902|FORD|ANALYST|7566|03-DEC-81|3000||20

$ cat dept.txt 
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON

$ cat citycode.txt 
1123|NEW YORK
1124|DALLAS
1125|CHICAGO
1126|BOSTON
1127|WASHINGTON

Output

$ awk 'BEGIN{FS=OFS="|"}FNR==1{f++}f==1{d[$1]=$2;c[$NF]=$1;next}f==2 && $NF in c{c[c[$NF]]=$1;next}$6 >=2850{print $0,($NF in d?d[$NF]:"NULL"),($NF in c?c[$NF]:"NULL")}' dept.txt citycode.txt emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124
like image 21
Akshay Hegde Avatar answered May 15 '23 04:05

Akshay Hegde