Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.

file A is something like:

Name (tab)  #  (tab)  #  (tab)  KEYFIELD  (tab)  Other fields

file B I managed to use cut and sed and other things to basically get it down to one field that is a list.

So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)

I tried to do:

grep -f fileBcutdown fileA > outputfile

EDIT: Ok I give up. I just force killed it.

Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.

EDIT: This is an example line in file A:

chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,

example line from file B cut down:

ENST00000111111
like image 867
Joe Avatar asked Sep 16 '25 23:09

Joe


1 Answers

Here's one way using GNU awk. Run like:

awk -f script.awk fileB.txt fileA.txt

Contents of script.awk:

FNR==NR {
    array[$0]++
    next
}

{
    line = $4
    sub(/\.[0-9]+$/, "", line)
    if (line in array) {
        print
    }
}

Alternatively, here's the one-liner:

awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt

GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.


UPDATE using files HumanGenCodeV12 and GenBasicV12:

Run like:

awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt

Contents of script.awk:

FNR==NR {
    gsub(/[^[:alnum:]]/,"",$12)
    array[$12]++
    next
}

{
    line = $4
    sub(/\.[0-9]+$/, "", line)
    if (line in array) {
        print
    }
}

This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.

like image 154
Steve Avatar answered Sep 19 '25 16:09

Steve