Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Unix/Bash, how can I make a lookup table?

Tags:

bash

unix

So I have a .txt list of gene names and probe IDs, originalFile.txt, like so:

GENE_ID PROBE_ID
10111   19873
10112   284, 19983
10113   187

There are about 30,000 rows in this text file. I would like to create a new text file with no commas in the second column, like:

GENE_ID PROBE_ID
10111   19873
10112   284
10112   19983
10113   187

...but also, I want all of the PROBE_IDs to come from another text file, probes.txt, which looks like:

19873
284
187

...so that I can make a finalProduct.txt file that looks like:

GENE_ID PROBE_ID
10111   19873
10112   284
10113   187

If I wanted to type in each row of probes.txt by hand, I think I could achieve this result with something like:

awk -F"/t" '{for(i=1;i<=NF;i++){if ($i ~ /probeID#/){print $i}}}' myGenes > test.txt

But, of course, this wouldn't put the comma-separated probe IDs on different rows, and I would have to input each of the thousands of probeIDs by hand.

Does anyone have any hints or better suggestions? Thank you!

EDIT FOR CLARITY
So I think there are two steps in what I'm asking. I'd like to take originalFile.txt and eventually produce finalProduct.txt, using probes.txt. There are two steps in this:

For each probe listed in probe.txt, find out if it exists in originalFile.txt; if the probe does exist, then print a line that has just the probe and the corresponding GENE_ID.

or you could think of it as some kind of join between filter on originalFile.txt using probes.txt, where the output file has the PROBE_ID column as the probes in probes.txt and the corresponding GENE_ID from originalFile.txt.

or you could think of it as: 1. make an intermediate file where there is a many-to-one correspondence between GENE_ID and PROBE_ID 2. remove all of the rows of that intermediate file where the PROBE_ID does not correspond to an entry in probes.txt

EDIT 2
Currently trying to repurpose this - no result yet, but maybe link will be helpful.

like image 828
K M Avatar asked May 18 '15 23:05

K M


People also ask

What is lookup table with example?

A lookup table is an array of data that maps input values to output values, thereby approximating a mathematical function. Given a set of input values, a lookup operation retrieves the corresponding output values from the table.

How do I find a lookup table?

Lookup tables must be text files with two columns and use commas, equals signs ( = ), or tabs as a delimiter. The key column is always the first column from the left, and the value column is the second column.

What uses a lookup table?

In data analysis applications, such as image processing, a lookup table (LUT) is used to transform the input data into a more desirable output format. For example, a grayscale picture of the planet Saturn will be transformed into a color image to emphasize the differences in its rings.


1 Answers

If probes.txt is small enough that it will fit in memory, you could try the following awk script:

BEGIN {
    OFS="\t";
    # this is to handle the given input that has spaces after the comma
    # and tabs between gene and probes
    FS="[\t, ]+";
    # load probes into an array
    while ((getline probe < "probes.txt") > 0) {
        probes[probe] = 1;
    }
    close ("probes.txt");
}

{
    # for each probe, check if it's in the array
    # and skip it if not
    for (i=2; i <= NF; i++) {
        if (probes[$i] == 1) {
            print $1, $i;
        }
    }
}
like image 195
Diego Avatar answered Oct 04 '22 01:10

Diego