I am using presidio/langchain_experimental to anonymize text in Polish, but it does not detect names (e.g., "Jan Kowalski"). Here is my code:
from presidio_anonymizer import PresidioAnonymizer
from presidio_reversible_anonymizer import PresidioReversibleAnonymizer
config = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "pl", "model_name": "pl_core_news_lg"}],
}
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
languages_config=config)
anonymizer_tool = PresidioReversibleAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
languages_config=config)
text = "Jan Kowalski mieszka w Warszawie i ma e-mail [email protected]."
anonymized_result = anonymizer_tool.anonymize(text)
anon_result = anonymizer.anonymize(text)
deanonymized_result = anonymizer_tool.deanonymize(anonymized_result)
print("Anonymized text:", anonymized_result)
print("Deanonymized text:", deanonymized_result)
print("Map:", anonymizer_tool.deanonymizer_mapping)
print("Anonymized text:", anon_result)
Output:
Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
Map: {}
Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
I expected the name "Jan Kowalski" and the email address to be anonymized, but the output remains unchanged. I have installed the pl_core_news_lg model using:
python -m spacy download pl_core_news_lg
Am I missing something in the configuration, or does Presidio not support Polish entity recognition properly? Any suggestions on how to make it detect names in Polish?
The interesting thing is that when I use only
anonymizer_tool = PresidioReversibleAnonymizer()
Then the output look like this:
Anonymized text: Elizabeth Tate mieszka w Warszawie i ma e-mail [email protected].
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
Map: {'PERSON': {'Elizabeth Tate': 'Jan Kowalski'}, 'EMAIL_ADDRESS': {'[email protected]': '[email protected]'}}
As mentioned below if I use only spaCy:
nlp = spacy.load("pl_core_news_lg")
doc = nlp(text)
Then the output is correct so I guess that it's the problem with presidio itself. Output from spaCy:
Jan Kowalski persName
Warszawie placeName
So I would not like to create custom analyzer for that but use spaCy in Presidio as it works as expected.
Presidio allows configuring the Analyzer (and NLP) engine using code or yaml configuration, allowing easy multilangual support and entity mapping:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
import tempfile
with tempfile.NamedTemporaryFile(delete=False) as languages_config:
languages_config.write(b"""
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_lg
-
lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
model_to_presidio_entity_mapping:
persName: PERSON
placeName: LOCATION
orgName: ORGANIZATION
geogName: LOCATION
date: DATE_TIME
"""
)
# Create NLP engine based on configuration file
provider = NlpEngineProvider(conf_file=languages_config.name)
nlp_engine_with_polish = provider.create_engine()
# Pass created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine_with_polish,
supported_languages=["en", "pl"]
)
# Analyze in different languages
results_polish = analyzer.analyze(text="Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].", language="pl")
print(results_polish)
results_english = analyzer.analyze(text="My name is David", language="en")
print(results_english)
Outputs:
[type: EMAIL_ADDRESS, start: 45, end: 69, score: 1.0, type: PERSON, start: 0, end: 12, score: 0.85, type: LOCATION, start: 23, end: 32, score: 0.85, type: URL, start: 58, end: 69, score: 0.5]
[type: PERSON, start: 11, end: 16, score: 0.85]
More code samples from the documentation: Configuring The NLP engine, No code configuration
Langchain does not expose an API to override the analyzer, you can workaround with private access:
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
anonymizer = PresidioReversibleAnonymizer(
analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"]
)
anonymizer._analyzer = analyzer
anonymizer.anonymize("Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].")
Output:
'Jonathan Johnson mieszka w Warszawie i ma e-mail [email protected].'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With