I'm trying to use spaCy to tokenize a text document, where named entities are wrapped in XML tags. E.g. TEI-like <personName>Harry</personName> goes to <orgName>Hogwarts</orgName>
.
import spacy
nlp = spacy.load('en')
txt = '<personName>Harry</personName> goes to <orgName>Hogwarts</orgName>. <personName>Sally</personName> lives in <locationName>London</locationName>.'
doc = nlp(txt)
sents = list(doc.sents)
for i, s in enumerate(doc.sents):
print("{}: {}".format(i, s))
However, the XML tags cause a sentence split:
0: <personName>
1: Harry</personName> goes to <orgName>
2: Hogwarts</orgName>.
3: <personName>
4: Sally</personName> lives in <
5: locationName>
6: London</locationName>.
How can I get only 2 sentences? I know that spaCy has a support for a custom tokenizer but since the rest of the text is standard, I'd like to keep using the built-in one or perhaps build on top of it to recognize the XML annotations.
I've managed to do it by counting the tokens, and keeping track of which annotations each token has, a bit convoluted but does the job.
Preparation:
pattern = re.compile('</?[a-zA-Z_]+>')
pattern_start = re.compile('<[a-zA-Z_]+>')
pattern_end = re.compile('</[a-zA-Z_]+>')
# xml matches the pattern above
def annotate(xml):
if xml[1] == '/':
return (xml[2:-1] + '-end')
else:
return (xml[1:-1] + '-start')
nlp = spacy.load('en')
txt = '<personName>Harry Potter</personName> goes to \
<orgName>Hogwarts</orgName>. <personName>Sally</personName> \
lives in #<locationName>London</locationName>.'
words = txt.split()
stripped_words = []
# A mapping between token index and its annotations
annotations = {}
all_tokens = []
# A mapping between stripped_words index and whether it's preceded by a space
no_space = {}
Now let's traverse the words and check for annotations. We'll split each one to three parts: prefix, tag and suffix. E.g. for <orgName>@Hogwarts.</orgName>
they'll be @
, Hogwarts
, and .
, respectively.
for i, w in enumerate(words):
matches = re.findall(pattern, w)
w_annotations = []
if len(matches) > 0:
for m in matches:
w_annotations.append(annotate(m))
splitted_start = re.split(pattern_start, w)
# TODO: we assume no word contains more than one annotation
if len(splitted_start) > 1:
prefix, rest = splitted_start
if len(prefix) > 0:
tokens = list(nlp(prefix))
all_tokens.extend(tokens)
# The prefix requires space before, but the tag itself not
no_space[len(stripped_words) + 1] = True
stripped_words.append(prefix)
else:
rest = splitted_start[0]
splitted_end = re.split(pattern_end, rest)
tag = splitted_end[0]
stripped_words.append(tag)
tokens = list(nlp(tag))
n_tokens = len(all_tokens)
for j, t in enumerate(tokens):
annotations[n_tokens + j] = w_annotations
all_tokens.extend(tokens)
if len(splitted_end) > 1:
suffix = splitted_end[1]
if len(suffix) > 0:
tokens = list(nlp(suffix))
all_tokens.extend(tokens)
no_space[len(stripped_words)] = True
stripped_words.append(suffix)
else:
stripped_words.append(w)
tokens = list(nlp(w))
all_tokens.extend(tokens)
Finally, let's print the sentences with their annotations:
stripped_txt = stripped_words[0]
for i, w in enumerate(stripped_words[1:]):
if (i + 1) in no_space:
stripped_txt += w
else:
stripped_txt += ' ' + w
doc = nlp(stripped_txt)
n_tokens = 0
for i, s in enumerate(doc.sents):
print("sentence{}: {}".format(i, s))
for j, t in enumerate(list(s)):
if n_tokens in annotations:
anons = annotations[n_tokens]
else:
anons = []
print("\t token{}: {}, annotations: {}".format(n_tokens, t, anons))
n_tokens += 1
Result:
sentence0: Harry Potter goes to Hogwarts.
token0: Harry, annotations: ['personName-start']
token1: Potter, annotations: ['personName-end']
token2: goes, annotations: []
token3: to, annotations: []
token4: Hogwarts, annotations: ['orgName-start', 'orgName-end']
token5: ., annotations: []
sentence1: Sally lives in #London.
token6: Sally, annotations: ['personName-start', 'personName-end']
token7: lives, annotations: []
token8: in, annotations: []
token9: #, annotations: []
token10: London, annotations: ['locationName-start', 'locationName-end']
token11: ., annotations: []
Full Code: https://gist.github.com/dimidd/1aba8b57643d5936f42670f0c5f344e4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With