I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string. Also is it possible to get the most common nouns, verbs, adverbs and so on?
There's a count_by function included but I cant seem to get it to run in any meaningful way.
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.
I recently had to count frequency of all the tokens in a text file. You can filter out words to get POS tokens you like using the pos_ attribute. Here is a simple example:
import spacy from collections import Counter nlp = spacy.load('en') doc = nlp(u'Your text here') # all tokens that arent stop words or punctuations words = [token.text for token in doc if not token.is_stop and not token.is_punct] # noun tokens that arent stop words or punctuations nouns = [token.text for token in doc if (not token.is_stop and not token.is_punct and token.pos_ == "NOUN")] # five most common tokens word_freq = Counter(words) common_words = word_freq.most_common(5) # five most common noun tokens noun_freq = Counter(nouns) common_nouns = noun_freq.most_common(5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With