Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search replace string in a file based on column in other file

Tags:

bash

shell

sed

awk

If we have the first file like below:

(a.txt)
 1  asm
 2  assert
 3  bio
 4  Bootasm
 5  bootmain
 6  buf
 7  cat
 8  console
 9  defs
10  echo

and the second like:

(b.txt)
 bio cat BIO bootasm
 bio defs cat
 Bio console 
 bio BiO
 bIo assert
 bootasm asm
 bootasm echo
 bootasm console
 bootmain buf
 bootmain bio
 bootmain bootmain
 bootmain defs
 cat cat
 cat assert
 cat assert

and we want the output will be like this:

 3 7 3 4
 3 9 7
 3 8
 3 3
 3 2
 4 1
 4 10
 4 8
 5 6
 5 3
 5 5
 5 9
 7 7
 7 2
 7 2

we read each second column in each file in the first file, we search if it exist in each column in each line in the second file if yes we replace it with the the number in the first column in the first file. i did it in only the fist column, i couldn't do it for the rest.

Here the command i use awk 'NR==FNR{a[$2]=$1;next}{$1=a[$1];}1' a.txt b.txt

3 cat bio bootasm
3 defs cat
3 console
3 bio
3 assert
4 asm
4 echo
4 console
5 buf
5 bio
5 bootmain
5 defs
7 cat
7 assert
7 assert

how should i do to the other columns ?

Thankyou

like image 568
user3057111 Avatar asked Dec 02 '13 11:12

user3057111


3 Answers

awk 'NR==FNR{h[$2]=$1;next} {for (i=1; i<=NF;i++) $i=h[$i];}1' a.txt b.txt

NR is the global record number (line number default) across all files. FNR is the line number for the current file. The NR==FNR block specifies what action to take when global line number is equal to the current number, which is only true for the first file, i.e., a.txt. The next statement in this block skips the rest of the code so the for loop is only available to the second file, e.i., b.txt.

First, we process the first file in order to store the word ids in an associative array: NR==FNR{h[$2]=$1;next}. After which, we can use these ids to map the words in the second file. The for loop (for (i=1; i<=NF;i++) $i=h[$i];) iterates over all columns and sets each column to a number instead of the string, so $i=h[$i] actually replaces the word at the ith column with its id. Finally the 1 at the end of the scripts causes all lines to be printed out.

Produces:

3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2

To make the script case-insensitive, add tolower calls into the array indices:

awk 'NR==FNR{h[tolower($2)]=$1;next} {for (i=1; i<=NF;i++) $i=h[tolower($i)];}1' a.txt b.txt
like image 153
perreal Avatar answered Oct 14 '22 16:10

perreal


divide and conquer!, a bit archaic but does the job =)

awk 'NR==FNR{a[$2]=$0;next}{$1=a[$1];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 1
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$2];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 2
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$3];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 3
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$4];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 4
paste 1 2 3 4 | tr '\t' ' '

gives:

3 7 3 4
3 9 7 
3 8  
3 3  
3 2  
4 1  
4 10  
4 8  
5 6  
5 3  
5 5  
5 9  
7 7  
7 2  
7 2  

in this case I just changed the number of columns and paste the results together with a bit of edition in between.

like image 43
Gery Avatar answered Oct 14 '22 14:10

Gery


{
 cat a.txt;  echo "--EndA--";cat b.txt
} | sed -n '1 h
1 !H
$ {
  x
: loop
  s/^ *\([[:digit:]]\{1,\}\) *\([^[:cntrl:]]*\)\(\n\)\(.*\)\2/\1 \2\3\4\1/
  t loop
  s/^ *[[:digit:]]\{1,\} *[^[:cntrl:]]*\n//
  t loop
  s/^[[:space:]]*--EndA--\n//
  p
  }
 '

"--EndA--" could be something else if chance that it will present in one of the file (a.txt mainly)

like image 3
NeronLeVelu Avatar answered Oct 14 '22 14:10

NeronLeVelu