Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace specific text with a redacted version using Python

I am looking to do the opposite of what has been done here:

import re

text = '1234-5678-9101-1213 1415-1617-1819-hello'

re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)

output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'

Partial replacement with re.sub()

My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.

The end result would look like:

XXXX went to XXXX XXXXXX

Sponge Bob went to Disney World.

In short, I am unmasking text and replacing it with a generated dataset using fuzzy.

like image 377
Green Avatar asked Dec 17 '19 21:12

Green


People also ask

How to replace text in a file in Python?

To replace text in a file we are going to open the file in read-only using the open () function. Then we will t=read and replace the content in the text file using the read () and replace () functions.

How do you replace all occurrences of a character in Python?

Replace all occurrences of a character in a string To replace all the occurrences of a character with a new character, we simply have to pass the old character and the new character as input to the replace() method when it is invoked on the string. You can observe this in the following example.

How do I replace a line in a text file?

And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course The simple for loop is a conventional way to traverse through every line in the given text file and find the line we want to replace. Then, the desired line can be replaced by using the replace () function.

How to replace text and string in pandas?

.applymap is another option to replace text and string in Pandas. This one is useful if you have additional conditions - sometimes the regex might be too complex or to not work. In this case the replacement can be done by lambda and .apply for columns:


1 Answers

You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.

NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.

Example:

Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.

Returns:

NER with spacy

Just be aware that this is not 100%!

Here are a little snippet for you to try out:

import spacy

phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
   doc = nlp(phrase)
   replaced = ""
   for token in doc:
      if token in doc.ents:
         replaced+="XXXX "
      else:
         replaced+=token.text+" "

Read more here: https://spacy.io/usage/linguistic-features#named-entities

You could, instead of replacing with XXXX, replace based on the entity type, like:

if ent.label_ == "PERSON":
   replaced += "<PERSON> "

Then:

import re, random

personames = ["Jack", "Mike", "Bob", "Dylan"]

phrase = re.replace("<PERSON>", random.choice(personames), phrase)
like image 53
Tiago Duque Avatar answered Oct 14 '22 11:10

Tiago Duque