Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct POS tags for numbers substituted with ## in spacy

The gigaword dataset is a huge corpus used to train abstractive summarization models. It contains summaries like these:

spain 's colonial posts #.## billion euro loss
taiwan shares close down #.## percent

I want to process these summaries with spacy and get the correct pos tag for each token. The issue is that all numbers in the dataset were replaced with # signs which spacy does not classify as numbers (NUM) but as other tags.

>>> import spacy
>>> from spacy.tokens import Doc
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.tokenizer = lambda raw: Doc(nlp.vocab, words=raw.split(' '))
>>> text = "spain 's colonial posts #.## billion euro loss"
>>> doc = nlp(text)
>>> [(token.text, token.pos_) for token in doc]
[('spain', 'PROPN'), ("'s", 'PART'), ('colonial', 'ADJ'), ('posts', 'NOUN'), ('#.##', 'PROPN'), ('billion', 'NUM'), ('euro', 'PROPN'), ('loss', 'NOUN')]

Is there a way to customize the POS tagger so that it classifies all tokens that only consist of #-sign and dots as numbers?

I know you replace the spacy POS tagger with your own or fine-tune it for your domain with additional data but I don't have tagged training data where all numbers are replaced with # and I would like to change the tagger as little as possible. I would prefer having a regular expression or fixed list of tokens that are always recognized as numbers.

like image 404
Pyfisch Avatar asked Feb 10 '20 13:02

Pyfisch


People also ask

What is JJ in POS tagging?

IN preposition/subordinating conjunction. JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest'

What are the issues with POS tagging?

The main problem with POS tagging is ambiguity. In English, many common words have multiple meanings and therefore multiple POS . The job of a POS tagger is to resolve this ambiguity accurately based on the context of use. For example, the word "shot" can be a noun or a verb.

What is the POS tag for unknown?

1.2 Limitations of Current POS Tagging System Limitation of this system is that if the word is not present in the corpus then it is tagged with unknown “UNK” tag. Hence, the accuracy of the system degrades with increase in number of unknown words.

Which is the simplest type of POS tagging?

It is the simplest POS tagging because it chooses most frequent tags associated with a word in training corpus. Transformation based tagging is also called Brill tagging. It is an instance of the transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given text.

What is POS tagging in NLP?

POS tagging is a disambiguation task. A word can have multiple POS tags; the goal is to find the right tag given the current context. For example, the work left can be a verb when used as ‘he left the room’ or a noun when used as ‘ left of the room’. POS tagging is a fundamental problem in NLP. There are many NLP tasks based on POS tags.

How to assign POS tagging in Spacy library?

Every token is assigned a POS Tag in Spacy from the following list: *., (, ), ?* POS Tagging in Spacy library is quite easy as seen in the below example. We just instantiate a Spacy object as doc. We iterate over doc object and use pos_ , tag_, to print the POS tag.

How to count the number of words in a POS tag?

Write the text whose pos_tag you want to count. Some words are in upper case and some in lower case, so it is appropriate to transform all the words in the lower case before applying tokenization. Pass the words through word_tokenize from nltk.


Video Answer


1 Answers

What about replacing # with a digit?

In a first version of this answer I chose the digit 9, because it reminds me of the COBOL numeric field formats I used some 30 years ago... But then I had a look at the dataset, and realized that for proper NLP processing one should get at least a couple of things straight:

  • ordinal numerals (1st, 2nd, ...)
  • dates

Ordinal numerals need special handling for any choice of digit, but the digit 1 produces reasonable dates, except for the year (of course, 1111 may or may not be interpreted as a valid year, but let's play it safe). 11/11/2020 is clearly better than 99/99/9999...

Here is the code:

import re

ic = re.IGNORECASE
subs = [
    (re.compile(r'\b1(nd)\b', flags=ic), r'2\1'),  # 1nd -> 2nd
    (re.compile(r'\b1(rd)\b', flags=ic), r'3\1'),  # 1rd -> 3rd
    (re.compile(r'\b1(th)\b', flags=ic), r'4\1'),  # 1th -> 4th
    (re.compile(r'11(st)\b', flags=ic), r'21\1'),  # ...11st -> ...21st
    (re.compile(r'11(nd)\b', flags=ic), r'22\1'),  # ...11nd -> ...22nd
    (re.compile(r'11(rd)\b', flags=ic), r'23\1'),  # ...11rd -> ...23rd
    (re.compile(r'\b1111\b'), '2020')              # 1111 -> 2020
]

text = '''spain 's colonial posts #.## billion euro loss
#nd, #rd, #th, ##st, ##nd, ##RD, ##TH, ###st, ###nd, ###rd, ###th.
ID=#nd#### year=#### OK'''

text = text.replace('#', '1')
for pattern, repl in subs:
    text = re.sub(pattern, repl, text)

print(text)
# spain 's colonial posts 1.11 billion euro loss
# 2nd, 3rd, 4th, 21st, 22nd, 23RD, 11TH, 121st, 122nd, 123rd, 111th.
# ID=1nd1111 year=2020 OK

If the preprocessing of the corpus converts any digit into a # anyway, you lose no information with this transformation. Some “true” # would become a 1, but this would probably be a minor problem compared to numbers not being recognized as such. Furthermore, in a visual inspection of about 500000 lines of the dataset I haven't been able to find any candidate for a “true” #.

N.B.: The \b in the above regular expressions stands for “word boundary”, i.e., the boundary between a \w (word) and a \W (non-word) character, where a word character is any alphanumeric character (further info here). The \1 in the replacement stands for the first group, i.e., the first pair of parentheses (further info here). Using \1 the case of all text is preserved, which would not be possible with replacement strings like 2nd. I later found that your dataset is normalized to all lower case, but I decided to keep it generic.

If you need to get the text with #s back from the parts of speech, it's simply

token.text.replace('0','#').replace('1','#').replace('2','#').replace('3','#').replace('4','#')
like image 127
Walter Tross Avatar answered Sep 30 '22 10:09

Walter Tross