Replacing values in large table using conversion table

Q: How to replace a value in a table in Excel?

Right-click on a column -> Select Replace Values 2. Go to the Transform tab -> click on Replace Values 3. You can right-click a value within a column and click on Replace Values After any of the 3 steps, the Replace Values pop-up screen appears. 1.2. Arguments Table.ReplaceValue Now let’s have a look at how replacing a value works.

Tags:

bash

unix

sed

large-data

awk

I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:

Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:

COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A 
rs1000000 12 125456933 G A 
rs10000010 4 21227772 T C 
rs10000012 4 1347325 G C 
rs10000013 4 36901464 C A 
rs10000017 4 84997149 T C 
rs1000002 3 185118462 T C 
rs10000023 4 95952929 T G 
...

I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:

COL1_b36       COL2_b37
rs10    7_92383888
rs1000000       12_126890980
rs10000010      4_21618674
rs10000012      4_1357325
rs10000013      4_37225069
rs10000017      4_84778125
rs1000002       3_183635768
rs10000023      4_95733906
...

The desired output would be a file where all values in the first column have been changed according to the conversion table:

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A 
12_126890980 12 125456933 G A 
4_21618674 4 21227772 T C 
4_1357325 4 1347325 G C 
4_37225069 4 36901464 C A 
4_84778125 4 84997149 T C 
3_183635768 3 185118462 T C 
4_95733906 4 95952929 T G 
...

Additional info:

Performance is an issue (the following command takes approximately a year:

while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE
A complete match is necessary before replacing
Not every value in the OLD_FILE can be found in the conversion table...
...but every value that could be replaced, can be found in the conversion table.

Any help is very much appreciated.

290

asked Jan 09 '13 12:01

KJ_

2 Answers

Here's one way using awk:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

Results:

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G

Explanation, in order of appearance:

NR==1 { next }            # simply skip processing the first line (header) of
                          # the first file in the arguments list (TABLE)

FNR==NR { ... }           # This is a construct that only returns true for the
                          # first file in the arguments list (TABLE)

a[$1]=$2                  # So when we loop through the TABLE file, we add the
                          # column one to an associative array, and we assign
                          # this key the value of column two

next                      # This simply skips processing the remainder of the
                          # code by forcing awk to read the next line of input

$1 in a { ... }           # Now when awk has finished processing the TABLE file,
                          # it will begin reading the second file in the
                          # arguments list which is OLD_FILE. So this construct
                          # is a condition that returns true literally if column
                          # one exists in the array

$1=a[$1]                  # re-assign column one's value to be the value held
                          # in the array

1                         # The 1 on the end simply enables default printing. It
                          # would be like saying: $1 in a { $1=a[$1]; print $0 }'

160

answered Sep 20 '22 13:09

Steve

This might work for you (GNU sed):

sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file

answered Sep 20 '22 13:09

potong

Related questions
                            
                                Having getopts to show help if no options provided
                            
                                use external python script to open maya and run another script inside maya
                            
                                Bash retrieve column number from column name
                            
                                Bash: Finding Powerset of array
                            
                                Running a bash script from alpine based docker
                            
                                Get all but first N arguments to a bash function
                            
                                if-statement always evaluates to TRUE
                            
                                How to iterate over positional parameters in a Bash script?
                            
                                How to make a bash script that creates 40 simultaneous instances of a program?
                            
                                bash: function + source + declare = boom
                            
                                What does it mean "bash < <( curl http://rvm.io/releases/rvm-install-head )" [closed]
                            
                                Creating empty file in all subfolders
                            
                                Bash c-style if statement and styling techniques
                            
                                bash script with ls $variable
                            
                                Best way to generate a random string following a specific pattern in Bash
                            
                                Subtracting strings (that are numbers) in a shell script
                            
                                Linux - get text from second tab
                            
                                regular expression extract string after a colon in bash
                            
                                How do I get a user's friendly username on UNIX?
                            
                                Remove all backup files in Linux using Shell script recursively [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With