Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SED or AWK script to replace multiple text

I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.

Scenario: I've student records (> 1 million) in students.txt. In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on

students.txt

10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...

I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below

encrypted_contact_numbers.txt

Phone_Number, Encrypted_Phone_Number

9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747

I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.

Expected Output:

10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...

I am using the below sed script to do this operation. It is working fine but too slowly.

Approach 1:

while read -r pattern replacement; do   
    sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt

Approach 2:

sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt

Is there any way to process this huge file quickly?

Update: 9-Feb-2018

Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?

students.txt : Updated version

10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...

like image 617
Dhanabalan Avatar asked Jan 23 '18 18:01

Dhanabalan


People also ask

Should I use sed or awk?

AWK, like sed, is a programming language that deals with large bodies of text. But while people use sed to process and modify text, people mostly use AWK as a tool for analysis and reporting.

Which is faster awk or sed?

Generally I would say grep is the fastest one, sed is the slowest. Of course this depends on what are you doing exactly. I find awk much faster than sed . You can speed up grep if you don't need real regular expressions but only simple fixed strings (option -F).

Does sed support multiline replacement?

By default, when sed reads a line in the pattern space, it discards the terminating newline (\n) character. Nevertheless, we can handle multi-line strings by doing nested reads for every newline.

How do you do multiple sed replacements?

You can tell sed to carry out multiple operations by just repeating -e (or -f if your script is in a file). sed -i -e 's/a/b/g' -e 's/b/d/g' file makes both changes in the single file named file , in-place.


1 Answers

awk to the rescue!

if you have enough memory to keep the phone_map file in memory

awk -F', *' 'NR==FNR{a[$1]=$2; next}
                    {key=substr($0,11,20)}
           key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file

not tested since you're missing the data file. It should speed up since both files will be scanned only once.

like image 140
karakfa Avatar answered Oct 07 '22 11:10

karakfa