Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Presidio with Langchain Experimental does not detect Polish names

I am using presidio/langchain_experimental to anonymize text in Polish, but it does not detect names (e.g., "Jan Kowalski"). Here is my code:

from presidio_anonymizer import PresidioAnonymizer
from presidio_reversible_anonymizer import PresidioReversibleAnonymizer

config = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "pl", "model_name": "pl_core_news_lg"}],
}

anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
                                languages_config=config)

anonymizer_tool = PresidioReversibleAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
                                               languages_config=config)

text = "Jan Kowalski mieszka w Warszawie i ma e-mail [email protected]."

anonymized_result = anonymizer_tool.anonymize(text)
anon_result = anonymizer.anonymize(text)
deanonymized_result = anonymizer_tool.deanonymize(anonymized_result)

print("Anonymized text:", anonymized_result)
print("Deanonymized text:", deanonymized_result)
print("Map:", anonymizer_tool.deanonymizer_mapping)
print("Anonymized text:", anon_result)

Output:

Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].
Map: {}
Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].

I expected the name "Jan Kowalski" and the email address to be anonymized, but the output remains unchanged. I have installed the pl_core_news_lg model using:

python -m spacy download pl_core_news_lg

Am I missing something in the configuration, or does Presidio not support Polish entity recognition properly? Any suggestions on how to make it detect names in Polish?

The interesting thing is that when I use only

anonymizer_tool = PresidioReversibleAnonymizer()

Then the output look like this:

Anonymized text: Elizabeth Tate mieszka w Warszawie i ma e-mail [email protected]. 
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail [email protected]. 
Map: {'PERSON': {'Elizabeth Tate': 'Jan Kowalski'}, 'EMAIL_ADDRESS': {'[email protected]': '[email protected]'}}

As mentioned below if I use only spaCy:

nlp = spacy.load("pl_core_news_lg")
doc = nlp(text)

Then the output is correct so I guess that it's the problem with presidio itself. Output from spaCy:

Jan Kowalski persName
Warszawie placeName

So I would not like to create custom analyzer for that but use spaCy in Presidio as it works as expected.

like image 543
Maltion Avatar asked Dec 06 '25 06:12

Maltion


1 Answers

Presidio allows configuring the Analyzer (and NLP) engine using code or yaml configuration, allowing easy multilangual support and entity mapping:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

import tempfile

with tempfile.NamedTemporaryFile(delete=False) as languages_config:
    languages_config.write(b"""
nlp_engine_name: spacy
models:
-
    lang_code: en
    model_name: en_core_web_lg
-
    lang_code: pl
    model_name: pl_core_news_lg

ner_model_configuration:
    model_to_presidio_entity_mapping:
        persName: PERSON
        placeName: LOCATION
        orgName: ORGANIZATION
        geogName: LOCATION
        date: DATE_TIME
"""
    )


# Create NLP engine based on configuration file
provider = NlpEngineProvider(conf_file=languages_config.name)
nlp_engine_with_polish = provider.create_engine()

# Pass created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_polish,
    supported_languages=["en", "pl"]
)

# Analyze in different languages
results_polish = analyzer.analyze(text="Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].", language="pl")
print(results_polish)

results_english = analyzer.analyze(text="My name is David", language="en")
print(results_english)
Outputs:
[type: EMAIL_ADDRESS, start: 45, end: 69, score: 1.0, type: PERSON, start: 0, end: 12, score: 0.85, type: LOCATION, start: 23, end: 32, score: 0.85, type: URL, start: 58, end: 69, score: 0.5]
[type: PERSON, start: 11, end: 16, score: 0.85]

More code samples from the documentation: Configuring The NLP engine, No code configuration

Langchain does not expose an API to override the analyzer, you can workaround with private access:

from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

anonymizer = PresidioReversibleAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"]
)
anonymizer._analyzer = analyzer
anonymizer.anonymize("Jan Kowalski mieszka w Warszawie i ma e-mail [email protected].")
Output:
'Jonathan Johnson mieszka w Warszawie i ma e-mail [email protected].'
like image 136
Sharon Hart Avatar answered Dec 08 '25 20:12

Sharon Hart



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!