Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SED or AWK replace all with patterns from another file

I am trying to do pattern replacement using SED script but its not working properly

sample_content.txt

288Y2RZDBPX1000000001dhana
JP2F64EI1000000002d
EU9V3IXI1000000003dfg1000000001dfdfds
XATSSSSFOO4dhanaUXIBB7TF71000000004adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN1000000005egw

patterns.txt

1000000001 9000000003
1000000002 2000000001
1000000003 3000000001
1000000004 4000000001
1000000005 5000000001

Expected output

288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

I am able to do with single SED replacement like

sed  's/1000000001/1000000003/g' sample_content.txt

Note:

  • Matching pattern is not in fixed position.
  • Single line may have multiple matching value to replace in sample_content.txt
  • Sample_content.txt and patterns.txt has > 1 Million records

File attachment link: https://drive.google.com/open?id=1dVzivKMirEQU3yk9KfPM6iE7tTzVRdt_

Could anyone suggest how can achieve this without affecting performance?

Updated on 11-Feb-2018

After analyzing the real file I just got a hint that there is a grade value at the 30 and 31th position. Which helps where and all we need to apply replacement.
If grade AB then replace the 10 digit phone number at 41-50 and 101-110
If grade BC then replace the 10 digit phone number at 11-20, 61-70 and 151-160
If grade DE then replace the 10 digit phone number at 1-10, 71-80, 151-160 and 181-190

Like this I am seeing 50 unique grades for 2 Million sample records.

{   grade=substr($0,110,2)} // identify grade
{ 
    if (grade == "AB") {
        print substr($0,41,10) ORS substr($0,101,10)
    } else if(RT == "BC"){
        print substr($0,11,10) ORS substr($0,61,10) ORS substr($0,151,10) 
    }

    like wise 50 coiditions
}

May I know, whether this approach is advisable or anyother better approach?

like image 653
Dhanabalan Avatar asked Feb 10 '18 09:02

Dhanabalan


1 Answers

Benchmarks for future reference

Test environment:

Using your sample files patterns.txt with 50,000 lines and contents.txt also with 50,000 lines.

All lines from patterns.txt are loaded in all solutions but only the first 1000 lines of contents.txt are examined.

Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4 and gnu awk 4.1.4

In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.

Results:

1. RavinderSingh13 1st awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}   {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txt

real    19m54.408s
user    19m44.097s
sys 0m1.981s

2. EdMorton 1st awk Solution

$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txt

real    20m3.420s
user    19m16.559s
sys 0m2.325s

3. Sed (my sed) solution

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txt

real    1m1.070s
user    0m59.562s
sys 0m1.443s

4. Cyrus sed solution

$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txt

real    1m0.506s
user    0m59.871s
sys 0m1.209s

5. RavinderSingh13 2nd awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txt

real    0m25.572s
user    0m25.204s
sys     0m0.040s

For a small amount of input data like 1000 lines, awk solution seems good. Lets make make another test with 9000 lines this time to compare performance

6.RavinderSingh13 2nd awk solution with 9000 lines

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -9000 contents.txt) >newcontents.txt

real    22m25.222s
user    22m19.567s
sys      0m2.091s

7. Sed Solution with 9000 lines

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txt

real    9m7.443s
user    9m0.552s
sys     0m2.650s

8. Parallel Seds Solution with 9000 lines

$ cat sedpar.sh
s=$SECONDS
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &
wait
cat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txt
echo "seconds elapsed: $(($SECONDS-$s))"

$ time ./sedpar.sh
seconds elapsed: 309

real    5m16.594s
user    9m43.331s
sys     0m4.232s

Splitting the task to more commands like three parallel seds seems that can speed things up.

For those who would like to repeat the benchmarks on their own PC you can download files contents.txt and patterns.txt either by OP's links or by my github:

contents.txt

patterns.txt

like image 99
George Vasiliou Avatar answered Oct 26 '22 02:10

George Vasiliou