I am trying to do pattern replacement using SED script but its not working properly
sample_content.txt
288Y2RZDBPX1000000001dhana
JP2F64EI1000000002d
EU9V3IXI1000000003dfg1000000001dfdfds
XATSSSSFOO4dhanaUXIBB7TF71000000004adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN1000000005egw
patterns.txt
1000000001 9000000003
1000000002 2000000001
1000000003 3000000001
1000000004 4000000001
1000000005 5000000001
Expected output
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
I am able to do with single SED replacement like
sed 's/1000000001/1000000003/g' sample_content.txt
Note:
File attachment link: https://drive.google.com/open?id=1dVzivKMirEQU3yk9KfPM6iE7tTzVRdt_
Could anyone suggest how can achieve this without affecting performance?
Updated on 11-Feb-2018
After analyzing the real file I just got a hint that there is a grade value at the 30 and 31th position. Which helps where and all we need to apply replacement.
If grade AB then replace the 10 digit phone number at 41-50 and 101-110
If grade BC then replace the 10 digit phone number at 11-20, 61-70 and 151-160
If grade DE then replace the 10 digit phone number at 1-10, 71-80, 151-160 and 181-190
Like this I am seeing 50 unique grades for 2 Million sample records.
{ grade=substr($0,110,2)} // identify grade
{
if (grade == "AB") {
print substr($0,41,10) ORS substr($0,101,10)
} else if(RT == "BC"){
print substr($0,11,10) ORS substr($0,61,10) ORS substr($0,151,10)
}
like wise 50 coiditions
}
May I know, whether this approach is advisable or anyother better approach?
Benchmarks for future reference
Test environment:
Using your sample files patterns.txt
with 50,000 lines and contents.txt
also with 50,000 lines.
All lines from patterns.txt
are loaded in all solutions but only the first 1000 lines of contents.txt
are examined.
Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4
and gnu awk 4.1.4
In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.
Results:
1. RavinderSingh13 1st awk solution
$ time awk 'FNR==NR{a[$1]=$2;next} {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt <(head -n 1000 contents.txt) >newcontents.txt
real 19m54.408s
user 19m44.097s
sys 0m1.981s
2. EdMorton 1st awk Solution
$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txt
real 20m3.420s
user 19m16.559s
sys 0m2.325s
3. Sed (my sed) solution
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txt
real 1m1.070s
user 0m59.562s
sys 0m1.443s
4. Cyrus sed solution
$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txt
real 1m0.506s
user 0m59.871s
sys 0m1.209s
5. RavinderSingh13 2nd awk solution
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -n 1000 contents.txt) >newcontents.txt
real 0m25.572s
user 0m25.204s
sys 0m0.040s
For a small amount of input data like 1000 lines, awk solution seems good. Lets make make another test with 9000 lines this time to compare performance
6.RavinderSingh13 2nd awk solution with 9000 lines
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -9000 contents.txt) >newcontents.txt
real 22m25.222s
user 22m19.567s
sys 0m2.091s
7. Sed Solution with 9000 lines
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txt
real 9m7.443s
user 9m0.552s
sys 0m2.650s
8. Parallel Seds Solution with 9000 lines
$ cat sedpar.sh
s=$SECONDS
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &
wait
cat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txt
echo "seconds elapsed: $(($SECONDS-$s))"
$ time ./sedpar.sh
seconds elapsed: 309
real 5m16.594s
user 9m43.331s
sys 0m4.232s
Splitting the task to more commands like three parallel seds seems that can speed things up.
For those who would like to repeat the benchmarks on their own PC you can download files contents.txt
and patterns.txt
either by OP's links or by my github:
contents.txt
patterns.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With