Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Practical examples of NLTK use [closed]

Tags:

python

nlp

nltk

I'm playing around with the Natural Language Toolkit (NLTK).

Its documentation (Book and HOWTO) are quite bulky and the examples are sometimes slightly advanced.

Are there any good but basic examples of uses/applications of NLTK? I'm thinking of things like the NTLK articles on the Stream Hacker blog.

like image 632
Mat Avatar asked Feb 08 '09 21:02

Mat


People also ask

What is NLTK and how it is useful in processing NLP text analysis?

NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.

Why do we need NLTK?

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.


1 Answers

Here's my own practical example for the benefit of anyone else looking this question up (excuse the sample text, it was the first thing I found on Wikipedia):

import nltk import pprint  tokenizer = None tagger = None  def init_nltk():     global tokenizer     global tagger     tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')     tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())  def tag(text):     global tokenizer     global tagger     if not tokenizer:         init_nltk()     tokenized = tokenizer.tokenize(text)     tagged = tagger.tag(tokenized)     tagged.sort(lambda x,y:cmp(x[1],y[1]))     return tagged  def main():     text = """Mr Blobby is a fictional character who featured on Noel     Edmonds' Saturday night entertainment show Noel's House Party,     which was often a ratings winner in the 1990s. Mr Blobby also     appeared on the Jamie Rose show of 1997. He was designed as an     outrageously over the top parody of a one-dimensional, mute novelty     character, which ironically made him distinctive, absurd and popular.     He was a large pink humanoid, covered with yellow spots, sporting a     permanent toothy grin and jiggling eyes. He communicated by saying     the word "blobby" in an electronically-altered voice, expressing     his moods through tone of voice and repetition.      There was a Mrs. Blobby, seen briefly in the video, and sold as a     doll.      However Mr Blobby actually started out as part of the 'Gotcha'     feature during the show's second series (originally called 'Gotcha     Oscars' until the threat of legal action from the Academy of Motion     Picture Arts and Sciences[citation needed]), in which celebrities     were caught out in a Candid Camera style prank. Celebrities such as     dancer Wayne Sleep and rugby union player Will Carling would be     enticed to take part in a fictitious children's programme based around     their profession. Mr Blobby would clumsily take part in the activity,     knocking over the set, causing mayhem and saying "blobby blobby     blobby", until finally when the prank was revealed, the Blobby     costume would be opened - revealing Noel inside. This was all the more     surprising for the "victim" as during rehearsals Blobby would be     played by an actor wearing only the arms and legs of the costume and     speaking in a normal manner.[citation needed]"""     tagged = tag(text)         l = list(set(tagged))     l.sort(lambda x,y:cmp(x[1],y[1]))     pprint.pprint(l)  if __name__ == '__main__':     main() 

Output:

[('rugby', None),  ('Oscars', None),  ('1990s', None),  ('",', None),  ('Candid', None),  ('"', None),  ('blobby', None),  ('Edmonds', None),  ('Mr', None),  ('outrageously', None),  ('.[', None),  ('toothy', None),  ('Celebrities', None),  ('Gotcha', None),  (']),', None),  ('Jamie', None),  ('humanoid', None),  ('Blobby', None),  ('Carling', None),  ('enticed', None),  ('programme', None),  ('1997', None),  ('s', None),  ("'", "'"),  ('[', '('),  ('(', '('),  (']', ')'),  (',', ','),  ('.', '.'),  ('all', 'ABN'),  ('the', 'AT'),  ('an', 'AT'),  ('a', 'AT'),  ('be', 'BE'),  ('were', 'BED'),  ('was', 'BEDZ'),  ('is', 'BEZ'),  ('and', 'CC'),  ('one', 'CD'),  ('until', 'CS'),  ('as', 'CS'),  ('This', 'DT'),  ('There', 'EX'),  ('of', 'IN'),  ('inside', 'IN'),  ('from', 'IN'),  ('around', 'IN'),  ('with', 'IN'),  ('through', 'IN'),  ('-', 'IN'),  ('on', 'IN'),  ('in', 'IN'),  ('by', 'IN'),  ('during', 'IN'),  ('over', 'IN'),  ('for', 'IN'),  ('distinctive', 'JJ'),  ('permanent', 'JJ'),  ('mute', 'JJ'),  ('popular', 'JJ'),  ('such', 'JJ'),  ('fictional', 'JJ'),  ('yellow', 'JJ'),  ('pink', 'JJ'),  ('fictitious', 'JJ'),  ('normal', 'JJ'),  ('dimensional', 'JJ'),  ('legal', 'JJ'),  ('large', 'JJ'),  ('surprising', 'JJ'),  ('absurd', 'JJ'),  ('Will', 'MD'),  ('would', 'MD'),  ('style', 'NN'),  ('threat', 'NN'),  ('novelty', 'NN'),  ('union', 'NN'),  ('prank', 'NN'),  ('winner', 'NN'),  ('parody', 'NN'),  ('player', 'NN'),  ('actor', 'NN'),  ('character', 'NN'),  ('victim', 'NN'),  ('costume', 'NN'),  ('action', 'NN'),  ('activity', 'NN'),  ('dancer', 'NN'),  ('grin', 'NN'),  ('doll', 'NN'),  ('top', 'NN'),  ('mayhem', 'NN'),  ('citation', 'NN'),  ('part', 'NN'),  ('repetition', 'NN'),  ('manner', 'NN'),  ('tone', 'NN'),  ('Picture', 'NN'),  ('entertainment', 'NN'),  ('night', 'NN'),  ('series', 'NN'),  ('voice', 'NN'),  ('Mrs', 'NN'),  ('video', 'NN'),  ('Motion', 'NN'),  ('profession', 'NN'),  ('feature', 'NN'),  ('word', 'NN'),  ('Academy', 'NN-TL'),  ('Camera', 'NN-TL'),  ('Party', 'NN-TL'),  ('House', 'NN-TL'),  ('eyes', 'NNS'),  ('spots', 'NNS'),  ('rehearsals', 'NNS'),  ('ratings', 'NNS'),  ('arms', 'NNS'),  ('celebrities', 'NNS'),  ('children', 'NNS'),  ('moods', 'NNS'),  ('legs', 'NNS'),  ('Sciences', 'NNS-TL'),  ('Arts', 'NNS-TL'),  ('Wayne', 'NP'),  ('Rose', 'NP'),  ('Noel', 'NP'),  ('Saturday', 'NR'),  ('second', 'OD'),  ('his', 'PP$'),  ('their', 'PP$'),  ('him', 'PPO'),  ('He', 'PPS'),  ('more', 'QL'),  ('However', 'RB'),  ('actually', 'RB'),  ('also', 'RB'),  ('clumsily', 'RB'),  ('originally', 'RB'),  ('only', 'RB'),  ('often', 'RB'),  ('ironically', 'RB'),  ('briefly', 'RB'),  ('finally', 'RB'),  ('electronically', 'RB-HL'),  ('out', 'RP'),  ('to', 'TO'),  ('show', 'VB'),  ('Sleep', 'VB'),  ('take', 'VB'),  ('opened', 'VBD'),  ('played', 'VBD'),  ('caught', 'VBD'),  ('appeared', 'VBD'),  ('revealed', 'VBD'),  ('started', 'VBD'),  ('saying', 'VBG'),  ('causing', 'VBG'),  ('expressing', 'VBG'),  ('knocking', 'VBG'),  ('wearing', 'VBG'),  ('speaking', 'VBG'),  ('sporting', 'VBG'),  ('revealing', 'VBG'),  ('jiggling', 'VBG'),  ('sold', 'VBN'),  ('called', 'VBN'),  ('made', 'VBN'),  ('altered', 'VBN'),  ('based', 'VBN'),  ('designed', 'VBN'),  ('covered', 'VBN'),  ('communicated', 'VBN'),  ('needed', 'VBN'),  ('seen', 'VBN'),  ('set', 'VBN'),  ('featured', 'VBN'),  ('which', 'WDT'),  ('who', 'WPS'),  ('when', 'WRB')] 
like image 169
Mat Avatar answered Sep 21 '22 13:09

Mat