Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the most common words using spacy?

Tags:

I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string. Also is it possible to get the most common nouns, verbs, adverbs and so on?

There's a count_by function included but I cant seem to get it to run in any meaningful way.

like image 759
Harry Loyd Avatar asked May 16 '16 11:05

Harry Loyd


People also ask

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does spaCy load (' en ') do?

Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.


1 Answers

I recently had to count frequency of all the tokens in a text file. You can filter out words to get POS tokens you like using the pos_ attribute. Here is a simple example:

import spacy from collections import Counter nlp = spacy.load('en') doc = nlp(u'Your text here') # all tokens that arent stop words or punctuations words = [token.text          for token in doc          if not token.is_stop and not token.is_punct]  # noun tokens that arent stop words or punctuations nouns = [token.text          for token in doc          if (not token.is_stop and              not token.is_punct and              token.pos_ == "NOUN")]  # five most common tokens word_freq = Counter(words) common_words = word_freq.most_common(5)  # five most common noun tokens noun_freq = Counter(nouns) common_nouns = noun_freq.most_common(5) 
like image 98
Paras Dahal Avatar answered Oct 19 '22 04:10

Paras Dahal