Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating Rule-based matching with SpaCy and Python for detecting addresses

Tags:

python

nlp

spacy

I have started learning Python's SpaCy lib or NLP a few days ago. I want to create Rule-based matching for detecting street addresses. This is the example of street names:

Esplanade 12
Fischerinsel 65
Esplanade 1
62 boulevard d'Alsace
80 avenue Ferdinand de Lesseps
73 avenue de Bouvines
41 Avenue des Pr'es
84 rue du Château
44 rue Sadi Carnot
Bernstrasse 324
Güntzelstrasse 6
80 Rue St Ferréol
75 rue des lieutemants Thomazo
87 cours Franklin Roosevelt
51 rue du Paillle en queue
16 Chemin Des Bateliers
65 rue Reine Elisabeth
91 rue Saint Germain
Grolmanstraße 41
Buelowstrasse 46
Waßmannsdorfer Chaussee 41
Sonnenallee 29
Gotthardstrasse 81
Augsburger Straße 65
Gotzkowskystrasse 41
Holstenwall 69
Leopoldstraße 40

So, street names are formed like this:

1st type:

<string (thats ending with 'strasse', 'gasse' or 'platz')> + <number>(letter can be attached to number, for examle 34a)

2nd type:

<number> + <'rue', 'avenue', 'platz', 'boulevard'> + <multiple strings strings>

3rd type:

<titled string> + <number>

But first two types are 90% of cases. This is the code:

import spacy
from spacy.matcher import Matcher
from spacy import displacy

nlp = spacy.load("en_core_web_trf")
disable = ['ner']
pattern = ['<i do not know how to write contitions for this>']

matcher = Matcher(nlp.vocab)
matcher.add("STREET", [pattern])

text_testing1 = "I live in Güntzelstrasse 16 in Berlin"
text_testing2 = "Send that to 73 rue de Napoleon 56 in Paris"

doc = nlp(text)
result = matcher(doc)
print(result)

I do not know how to write pattern for this kind of recognition, so I need help with that. Phrase needs to have number in it, one of the strings must be 'rue', 'avenue', 'platz', 'boulevard' or it has to end with "strasse" or "gasse".

like image 350
taga Avatar asked Dec 30 '25 08:12

taga


1 Answers

Here's a very simple example that matches just things like "*strasse [number]":

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [
        {"TEXT": {"REGEX": ".*strasse$"}}, 
        {"IS_DIGIT": True}
        ]
matcher.add("ADDRESS", [pattern])

doc = nlp("I live in Güntzelstrasse 16 in Berlin")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

The key part is the pattern. By changing the pattern you can make it match more things, for example if we want to match things that end in not just strasse but also platz:

pattern = [
        {"TEXT": {"REGEX": ".*(strasse|platz)$"}}, 
        {"IS_DIGIT": True}
        ]

You can also add multiple patterns with the same label to get very different structures, like for your "rue de Napoleon" example.

The Matcher has a lot of features, I really recommend reading through the docs and trying them all out once.

like image 176
polm23 Avatar answered Jan 01 '26 00:01

polm23