Correct POS tags for numbers substituted with ## in spacy

Tags:

The gigaword dataset is a huge corpus used to train abstractive summarization models. It contains summaries like these:

spain 's colonial posts #.## billion euro loss
taiwan shares close down #.## percent

I want to process these summaries with spacy and get the correct pos tag for each token. The issue is that all numbers in the dataset were replaced with # signs which spacy does not classify as numbers (NUM) but as other tags.

>>> import spacy
>>> from spacy.tokens import Doc
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.tokenizer = lambda raw: Doc(nlp.vocab, words=raw.split(' '))
>>> text = "spain 's colonial posts #.## billion euro loss"
>>> doc = nlp(text)
>>> [(token.text, token.pos_) for token in doc]
[('spain', 'PROPN'), ("'s", 'PART'), ('colonial', 'ADJ'), ('posts', 'NOUN'), ('#.##', 'PROPN'), ('billion', 'NUM'), ('euro', 'PROPN'), ('loss', 'NOUN')]

Is there a way to customize the POS tagger so that it classifies all tokens that only consist of #-sign and dots as numbers?

I know you replace the spacy POS tagger with your own or fine-tune it for your domain with additional data but I don't have tagged training data where all numbers are replaced with # and I would like to change the tagger as little as possible. I would prefer having a regular expression or fixed list of tokens that are always recognized as numbers.

404

asked Feb 10 '20 13:02

Pyfisch

Video Answer

1 Answers

What about replacing # with a digit?

In a first version of this answer I chose the digit 9, because it reminds me of the COBOL numeric field formats I used some 30 years ago... But then I had a look at the dataset, and realized that for proper NLP processing one should get at least a couple of things straight:

ordinal numerals (1st, 2nd, ...)
dates

Ordinal numerals need special handling for any choice of digit, but the digit 1 produces reasonable dates, except for the year (of course, 1111 may or may not be interpreted as a valid year, but let's play it safe). 11/11/2020 is clearly better than 99/99/9999...

Here is the code:

import re

ic = re.IGNORECASE
subs = [
    (re.compile(r'\b1(nd)\b', flags=ic), r'2\1'),  # 1nd -> 2nd
    (re.compile(r'\b1(rd)\b', flags=ic), r'3\1'),  # 1rd -> 3rd
    (re.compile(r'\b1(th)\b', flags=ic), r'4\1'),  # 1th -> 4th
    (re.compile(r'11(st)\b', flags=ic), r'21\1'),  # ...11st -> ...21st
    (re.compile(r'11(nd)\b', flags=ic), r'22\1'),  # ...11nd -> ...22nd
    (re.compile(r'11(rd)\b', flags=ic), r'23\1'),  # ...11rd -> ...23rd
    (re.compile(r'\b1111\b'), '2020')              # 1111 -> 2020
]

text = '''spain 's colonial posts #.## billion euro loss
#nd, #rd, #th, ##st, ##nd, ##RD, ##TH, ###st, ###nd, ###rd, ###th.
ID=#nd#### year=#### OK'''

text = text.replace('#', '1')
for pattern, repl in subs:
    text = re.sub(pattern, repl, text)

print(text)
# spain 's colonial posts 1.11 billion euro loss
# 2nd, 3rd, 4th, 21st, 22nd, 23RD, 11TH, 121st, 122nd, 123rd, 111th.
# ID=1nd1111 year=2020 OK

If the preprocessing of the corpus converts any digit into a # anyway, you lose no information with this transformation. Some “true” # would become a 1, but this would probably be a minor problem compared to numbers not being recognized as such. Furthermore, in a visual inspection of about 500000 lines of the dataset I haven't been able to find any candidate for a “true” #.

N.B.: The \b in the above regular expressions stands for “word boundary”, i.e., the boundary between a \w (word) and a \W (non-word) character, where a word character is any alphanumeric character (further info here). The \1 in the replacement stands for the first group, i.e., the first pair of parentheses (further info here). Using \1 the case of all text is preserved, which would not be possible with replacement strings like 2nd. I later found that your dataset is normalized to all lower case, but I decided to keep it generic.

If you need to get the text with #s back from the parts of speech, it's simply

token.text.replace('0','#').replace('1','#').replace('2','#').replace('3','#').replace('4','#')

127

answered Sep 30 '22 10:09

Walter Tross

Related questions
                            
                                Read timed out. error while sending a POST request to a node.js API
                            
                                How do I send an embed message that contains multiple links parsed from a website to a webhook?
                            
                                pandas merge_asof: ambiguous argument types error
                            
                                I happen to stumble upon this code :" With for w in words:, the example would attempt to create an infinite list
                            
                                Install dependencies in Azure Functions with apt-get WITHOUT Docker
                            
                                Accuracy no longer improving after switching to Dataset
                            
                                I think Librosa.effect.split has some problem?
                            
                                multiprocessing.Pool hangs indefinitely after close/join
                            
                                Python hangs for too long on with open
                            
                                LSTM time series - strange val_accuarcy, which normalizing method to use and what to do in production after model is fited
                            
                                Implementing code to start multiple threads on top of multiple process in python
                            
                                Pywinauto timings waiting 0.5 seconds instead of immediate
                            
                                ValueError: Axes instance argument was not found in a figure, Question with same name has no answer
                            
                                Embed widgets with pythreejs: wrong perspective and camera look-at
                            
                                Running pre-commit hooks (e.g. pylint) when developing with docker
                            
                                Merging two DataFrames based on indexes from two other DataFrames
                            
                                Fastai - failed initiation of language model in Sentence Piece Processor, cache_dir parameter
                            
                                How can I get a tqdm progress_apply bar in vscode + python jupyter extension?
                            
                                How to override the default browser selection in Windows 7 when opening webppages with Python
                            
                                Object of type Response is not JSON serializable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correct POS tags for numbers substituted with ## in spacy

Tags:

python

spacy

pos-tagger

Pyfisch

People also ask

Video Answer

1 Answers

Walter Tross

Recent Activity

Donate For Us