In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
output:
[This, is, a, #, sentence, .]
I'd like to have hashtags tokenized as follows, is that possible?
[This, is, a, #sentence, .]
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
SpaCy automatically breaks your document into tokens when a document is created using the model.
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.
The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "@Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)
This yields:
['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']
This is more of a add-on to the great answer by @DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by @csvance). spaCy features (since V2.0) the add_pipe
method. Meaning you can define @DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.
Citations starts here:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.
PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With