I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.
Scenario:
I've student records (> 1 million) in students.txt
.
In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on
students.txt
10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...
I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below
encrypted_contact_numbers.txt
Phone_Number, Encrypted_Phone_Number
9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747
I wanted to replace all the contact numbers (11th–20th position) in students.txt
with the corresponding encrypted phone number from encrypted_contact_numbers.txt
.
Expected Output:
10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...
I am using the below sed script to do this operation. It is working fine but too slowly.
Approach 1:
while read -r pattern replacement; do
sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt
Approach 2:
sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt
Is there any way to process this huge file quickly?
Update: 9-Feb-2018
Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?
students.txt : Updated version
10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...
AWK, like sed, is a programming language that deals with large bodies of text. But while people use sed to process and modify text, people mostly use AWK as a tool for analysis and reporting.
Generally I would say grep is the fastest one, sed is the slowest. Of course this depends on what are you doing exactly. I find awk much faster than sed . You can speed up grep if you don't need real regular expressions but only simple fixed strings (option -F).
By default, when sed reads a line in the pattern space, it discards the terminating newline (\n) character. Nevertheless, we can handle multi-line strings by doing nested reads for every newline.
You can tell sed to carry out multiple operations by just repeating -e (or -f if your script is in a file). sed -i -e 's/a/b/g' -e 's/b/d/g' file makes both changes in the single file named file , in-place.
awk
to the rescue!
if you have enough memory to keep the phone_map file in memory
awk -F', *' 'NR==FNR{a[$1]=$2; next}
{key=substr($0,11,20)}
key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file
not tested since you're missing the data file. It should speed up since both files will be scanned only once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With