Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Built-in function to get the frequency of one word with spaCy?

Tags:

python

nlp

spacy

I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?

What I would like would be the NLTK equivalent of:

tokens.count("word") #where tokens is the tokenized text in which the word is to be counted

In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.

Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.

like image 422
Michael Gauthier Avatar asked Oct 27 '25 03:10

Michael Gauthier


1 Answers

I use spaCy for frequency counts in corpora quite often. This is what I usually do:

import spacy
nlp = spacy.load("en_core_web_sm")

list_of_words = ['run', 'jump', 'catch']

def word_count(string):
    words_counted = 0
    my_string = nlp(string)

    for token in my_string:
        # actual word
        word = token.text
        # lemma
        lemma_word = token.lemma_
        # part of speech
        word_pos = token.pos_
        if lemma_word in list_of_words:
            words_counted += 1
            print(lemma_word)
    return words_counted


sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)


like image 156
Nester Avatar answered Oct 29 '25 07:10

Nester