I'm struggling with a basic awk command.
File 1 :
AB253828.1
AB253829.1
AB253830.1
AB253831.1
File 2 :
accession accession.version taxid gi
A00001 A00001.1 10641 58418
A00002 A00002.1 9913 2
A00003 A00003.1 9913 3
A00004 A00004.1 32630 57971
A00005 A00005.1 32630 57972
A00006 A00006.1 32630 57973
A00008 A00008.1 32630 57974
A00009 A00009.1 32630 57975
A00010 A00010.1 32630 57976
both file have >1 000 000 lines
I would like to print columns 2 and 3 of file 2 if column 2 corresponds to the patterns of file 1 I tried a lot of possibilities but none work...
for ACC in $(cat file1.txt)
do
#ACC1=$(echo "\"$ACC\"")
awk -v OFS='\t'-v z="$ACC" '{ if($2 == z) { print $2,$3 } }' file2.txt
done
I got
awk: cannot open { if($2 == z) { print $2,$3 } } file2.txt (No such file or directory)
I checked, file2 is there. I suppose, my problem is the variable z but I can't find the solution.
The immediate problem is that you are missing a space before the second -v option. (Look closely: you are setting the OFS to \t-v and then Awk thinks z="$ACC" is your actual Awk script, and looks for - and complains about the lack of - a file named ... your Awk script's contents.) But really, you want to overhaul this more thoroughly.
awk -v OFS='\t' 'NR==FNR { z[$1]++; next }
$2 in z { print $2,$3 }' file1.txt file2.txt
This uses a common Awk idiom for reading the first file into memory, then printing out the records from the second whose second field existed as an entry in the first file. This should be orders of magnitude faster, as well as of course trivially fix the reading lines with for antipattern.
If the first file is too large to fit into memory at once, maybe partition it into smaller pieces (say 500,000 lines each?) and run this on each of those separately. It should be easy to see when Awk consumes so much memory that your system starts thrashing; at least during the first few runs, keep an eye on top or some similar monitoring tool and kill the process if it misbehaves.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With