So I have a .txt list of gene names and probe IDs, originalFile.txt, like so:
GENE_ID PROBE_ID
10111 19873
10112 284, 19983
10113 187
There are about 30,000 rows in this text file. I would like to create a new text file with no commas in the second column, like:
GENE_ID PROBE_ID
10111 19873
10112 284
10112 19983
10113 187
...but also, I want all of the PROBE_IDs to come from another text file, probes.txt, which looks like:
19873
284
187
...so that I can make a finalProduct.txt file that looks like:
GENE_ID PROBE_ID
10111 19873
10112 284
10113 187
If I wanted to type in each row of probes.txt by hand, I think I could achieve this result with something like:
awk -F"/t" '{for(i=1;i<=NF;i++){if ($i ~ /probeID#/){print $i}}}' myGenes > test.txt
But, of course, this wouldn't put the comma-separated probe IDs on different rows, and I would have to input each of the thousands of probeIDs by hand.
Does anyone have any hints or better suggestions? Thank you!
EDIT FOR CLARITY
So I think there are two steps in what I'm asking. I'd like to take originalFile.txt and eventually produce finalProduct.txt, using probes.txt. There are two steps in this:
For each probe listed in probe.txt, find out if it exists in originalFile.txt; if the probe does exist, then print a line that has just the probe and the corresponding GENE_ID.
or you could think of it as some kind of join between filter on originalFile.txt using probes.txt, where the output file has the PROBE_ID column as the probes in probes.txt and the corresponding GENE_ID from originalFile.txt.
or you could think of it as: 1. make an intermediate file where there is a many-to-one correspondence between GENE_ID and PROBE_ID 2. remove all of the rows of that intermediate file where the PROBE_ID does not correspond to an entry in probes.txt
EDIT 2
Currently trying to repurpose this - no result yet, but maybe link will be helpful.
A lookup table is an array of data that maps input values to output values, thereby approximating a mathematical function. Given a set of input values, a lookup operation retrieves the corresponding output values from the table.
Lookup tables must be text files with two columns and use commas, equals signs ( = ), or tabs as a delimiter. The key column is always the first column from the left, and the value column is the second column.
In data analysis applications, such as image processing, a lookup table (LUT) is used to transform the input data into a more desirable output format. For example, a grayscale picture of the planet Saturn will be transformed into a color image to emphasize the differences in its rings.
If probes.txt
is small enough that it will fit in memory, you could try the following awk
script:
BEGIN {
OFS="\t";
# this is to handle the given input that has spaces after the comma
# and tabs between gene and probes
FS="[\t, ]+";
# load probes into an array
while ((getline probe < "probes.txt") > 0) {
probes[probe] = 1;
}
close ("probes.txt");
}
{
# for each probe, check if it's in the array
# and skip it if not
for (i=2; i <= NF; i++) {
if (probes[$i] == 1) {
print $1, $i;
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With