Replace specific text with a redacted version using Python

Tags:

I am looking to do the opposite of what has been done here:

import re

text = '1234-5678-9101-1213 1415-1617-1819-hello'

re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)

output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'

Partial replacement with re.sub()

My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.

The end result would look like:

XXXX went to XXXX XXXXXX

Sponge Bob went to Disney World.

In short, I am unmasking text and replacing it with a generated dataset using fuzzy.

377

asked Dec 17 '19 21:12

Green

1 Answers

You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.

NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.

Example:

Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.

Returns:

NER with spacy

Just be aware that this is not 100%!

Here are a little snippet for you to try out:

import spacy

phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
   doc = nlp(phrase)
   replaced = ""
   for token in doc:
      if token in doc.ents:
         replaced+="XXXX "
      else:
         replaced+=token.text+" "

Read more here: https://spacy.io/usage/linguistic-features#named-entities

You could, instead of replacing with XXXX, replace based on the entity type, like:

if ent.label_ == "PERSON":
   replaced += "<PERSON> "

Then:

import re, random

personames = ["Jack", "Mike", "Bob", "Dylan"]

phrase = re.replace("<PERSON>", random.choice(personames), phrase)

answered Oct 14 '22 11:10

Tiago Duque

Related questions
                            
                                joining output from regex search
                            
                                Python: Calculating frequency over time from a wav file in Python?
                            
                                Detect if python program is executed via Windows GUI (double-click) vs command prompt
                            
                                Is it possible to add undirected and directed edges to a graph object in networkx?
                            
                                Reading Gmail Email in Python
                            
                                How to send and consume json messages using confluent-kafka in Python
                            
                                Kubernetes Python client connection Issue
                            
                                Group nodes together in networkx
                            
                                Does oversampling happen before or after cross-validation using imblearn pipelines?
                            
                                AttributeError: 'DataFrame' object has no attribute 'droplevel' in pandas
                            
                                How to have a mix of both Celery Executor and Kubernetes Executor in Apache Airflow?
                            
                                Install from pipfile using pipenv install gives error
                            
                                Read YAML file as list
                            
                                Unable to search a query with symbols in elasticsearch
                            
                                Multiprocessing code fails when run with pdb?
                            
                                Every product/combination of nested dictionaries saved to DataFrame
                            
                                how do I use pd.melt() across multiple columns?
                            
                                How can one use HashiCorp Vault in Airflow?
                            
                                If the feather file format still relevant or is the community leaning towards other file formats for large file storage?
                            
                                Qt: Session management error: None of the authentication protocols specified are supported. When using Python sockets on Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace specific text with a redacted version using Python

Tags:

python-3.x

nlp

lstm

Green

People also ask

1 Answers

Tiago Duque

Recent Activity

Donate For Us