SED or AWK script to replace multiple text

Tags:

I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.

Scenario: I've student records (> 1 million) in students.txt. In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on

students.txt

10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...

I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below

encrypted_contact_numbers.txt

Phone_Number, Encrypted_Phone_Number

9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747

I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.

Expected Output:

10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...

I am using the below sed script to do this operation. It is working fine but too slowly.

Approach 1:

while read -r pattern replacement; do   
    sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt

Approach 2:

sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt

Is there any way to process this huge file quickly?

Update: 9-Feb-2018

Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?

students.txt : Updated version

10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...

617

asked Jan 23 '18 18:01

Dhanabalan

1 Answers

awk to the rescue!

if you have enough memory to keep the phone_map file in memory

awk -F', *' 'NR==FNR{a[$1]=$2; next}
                    {key=substr($0,11,20)}
           key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file

not tested since you're missing the data file. It should speed up since both files will be scanned only once.

140

answered Oct 07 '22 11:10

karakfa

Related questions
                            
                                Instrumenting Python Code
                            
                                Bokeh scatterplot with gradient colors
                            
                                Python custom decorator not working with Celery tasks [duplicate]
                            
                                How to define a setup method only called once during testing with nosetest?
                            
                                Spark: Extracting summary for a ML logistic regression model from a pipeline model
                            
                                Django Queryset for concat query fullname of first_name and last_name
                            
                                Custom metrics with tf.estimator
                            
                                How to check django staff user first time login in admin panel?
                            
                                How to add new dictionary into existed json file with dictionary?
                            
                                Queryset in __init__ Form Django
                            
                                How to convert unicode string into normal text in python
                            
                                Pyspark, Add a character in the middle of a string
                            
                                I cannot use pylint in VSC using pipenv & bash for windows 10
                            
                                Python Replace Whole Values in Dataframe String and Not Substrings
                            
                                Python asyncio difference between loop.create_task and asyncio.run_coroutine_threadsafe
                            
                                REGEX filter with Pandas (any numeric combination followed by 'plus' sign)
                            
                                Clearing user-created variables in Python
                            
                                Is __repr__ supposed to return bytes or unicode?
                            
                                Is my understanding of how Python is written/implemented correct?
                            
                                How to hash int/long using hashlib in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SED or AWK script to replace multiple text

Tags:

python

unix

sed

awk

perl

Dhanabalan

People also ask

1 Answers

karakfa

Recent Activity

Donate For Us