Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How could spacy tokenize hashtag as a whole?

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:

import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]

output:

[This, is, a, #, sentence, .]

I'd like to have hashtags tokenized as follows, is that possible?

[This, is, a, #sentence, .]
like image 388
jmague Avatar asked Apr 13 '17 09:04

jmague


People also ask

How do you Tokenize with spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Does spaCy automatically Tokenize?

SpaCy automatically breaks your document into tokens when a document is created using the model.

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

How does NLP tokenizer work?

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.


2 Answers

I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.

The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.

from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load('en')

# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"

# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match

text = "@Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)

This yields:

['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']
like image 171
Jens Avatar answered Oct 05 '22 14:10

Jens


This is more of a add-on to the great answer by @DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by @csvance). spaCy features (since V2.0) the add_pipe method. Meaning you can define @DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.

Citations starts here:

def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.head is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.head.text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc

nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)

doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'

Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.

PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)

like image 37
Moritz Avatar answered Oct 05 '22 14:10

Moritz