How could spacy tokenize hashtag as a whole?

Tags:

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:

import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]

output:

[This, is, a, #, sentence, .]

I'd like to have hashtags tokenized as follows, is that possible?

[This, is, a, #sentence, .]

388

asked Apr 13 '17 09:04

jmague

2 Answers

I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.

The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.

from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load('en')

# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"

# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match

text = "@Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)

This yields:

['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']

171

answered Oct 05 '22 14:10

Jens

This is more of a add-on to the great answer by @DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by @csvance). spaCy features (since V2.0) the add_pipe method. Meaning you can define @DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.

Citations starts here:

def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.head is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.head.text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc

nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)

doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'

Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.

PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)

answered Oct 05 '22 14:10

Moritz

Related questions
                            
                                PyQt GUI size on high resolution screens
                            
                                Python: Append tuple to a set with tuples
                            
                                Inserting an element before each element of a list
                            
                                Add leading zeroes to a string Python [duplicate]
                            
                                How is this a coroutine?
                            
                                python: boto3 _send_request() error
                            
                                Example script to Send SMS via AWS SNS using boto
                            
                                Django Reverse accessor error
                            
                                python function yields tuple and only one element is wanted
                            
                                Search for a pattern in numpy array
                            
                                group multiple plot in one figure python
                            
                                How do I perform a math operation on a Python Pandas dataframe column, but only if a certain condition is met?
                            
                                How to drop rows by list in pandas [duplicate]
                            
                                How to pass arguments to python function whose first parameter is self?
                            
                                Python spyder could not initialize GLX
                            
                                Animate+Smoothly interpolate between matrices
                            
                                Calculating age in python
                            
                                Setting value to a copy of a slice of a DataFrame
                            
                                Plotting a numpy array as a histogram
                            
                                How to install multiple whl files in cmd

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How could spacy tokenize hashtag as a whole?

Tags:

python

tokenize

hashtag

spacy

jmague

People also ask

2 Answers

Jens

Moritz

Recent Activity

Donate For Us