Using Unix/Bash, how can I make a lookup table?

Tags:

unix

So I have a .txt list of gene names and probe IDs, originalFile.txt, like so:

GENE_ID PROBE_ID
10111   19873
10112   284, 19983
10113   187

There are about 30,000 rows in this text file. I would like to create a new text file with no commas in the second column, like:

GENE_ID PROBE_ID
10111   19873
10112   284
10112   19983
10113   187

...but also, I want all of the PROBE_IDs to come from another text file, probes.txt, which looks like:

19873
284
187

...so that I can make a finalProduct.txt file that looks like:

GENE_ID PROBE_ID
10111   19873
10112   284
10113   187

If I wanted to type in each row of probes.txt by hand, I think I could achieve this result with something like:

awk -F"/t" '{for(i=1;i<=NF;i++){if ($i ~ /probeID#/){print $i}}}' myGenes > test.txt

But, of course, this wouldn't put the comma-separated probe IDs on different rows, and I would have to input each of the thousands of probeIDs by hand.

Does anyone have any hints or better suggestions? Thank you!

EDIT FOR CLARITY
So I think there are two steps in what I'm asking. I'd like to take originalFile.txt and eventually produce finalProduct.txt, using probes.txt. There are two steps in this:

For each probe listed in probe.txt, find out if it exists in originalFile.txt; if the probe does exist, then print a line that has just the probe and the corresponding GENE_ID.

or you could think of it as some kind of ~~join between~~ filter on originalFile.txt using probes.txt, where the output file has the PROBE_ID column as the probes in probes.txt and the corresponding GENE_ID from originalFile.txt.

or you could think of it as: 1. make an intermediate file where there is a many-to-one correspondence between GENE_ID and PROBE_ID 2. remove all of the rows of that intermediate file where the PROBE_ID does not correspond to an entry in probes.txt

EDIT 2
Currently trying to repurpose this - no result yet, but maybe link will be helpful.

828

asked May 18 '15 23:05

K M

1 Answers

If probes.txt is small enough that it will fit in memory, you could try the following awk script:

BEGIN {
    OFS="\t";
    # this is to handle the given input that has spaces after the comma
    # and tabs between gene and probes
    FS="[\t, ]+";
    # load probes into an array
    while ((getline probe < "probes.txt") > 0) {
        probes[probe] = 1;
    }
    close ("probes.txt");
}

{
    # for each probe, check if it's in the array
    # and skip it if not
    for (i=2; i <= NF; i++) {
        if (probes[$i] == 1) {
            print $1, $i;
        }
    }
}

195

answered Oct 04 '22 01:10

Diego

Related questions
                            
                                In terminal (Bash), how do you autocomplete partial matches in the middle of a filename?
                            
                                terminate a shell script without waiting for early parts of pipeline
                            
                                How to tell if you're in a git-svn repo command line?
                            
                                capture pid of terminated background process using trap in bash
                            
                                Move file to another directory once it is done transferring
                            
                                Truly Portable Git
                            
                                How do I set $? in functions called by PS1?
                            
                                music with shell script [closed]
                            
                                Zenity --progress from Handbrake CLI output
                            
                                What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?
                            
                                Bash check element in array for elements in another array
                            
                                Updating CRON with bash script
                            
                                Passing system properties that contains spaces to Tomcat through JAVA_OPTS
                            
                                Debug bash/ksh script and subscripts
                            
                                How should I use exact keyword matching as a condition in the case statement?
                            
                                Loop control from within a subshell
                            
                                in vim command line mode: how to kill the line from the current cursor position to the end
                            
                                Pass input to interactive command line program in bash [duplicate]
                            
                                mm:ss calculator from shell prompt?
                            
                                How to see changed lines with certain words and the containing file for a git commit? - Can git diff print a file name line prefix?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Unix/Bash, how can I make a lookup table?

Tags:

bash

unix

K M

People also ask

1 Answers

Diego

Recent Activity

Donate For Us