Improving the extraction of human names with nltk [closed]

Tags:

I am trying to extract human names from text.

Does anyone have a method that they would recommend?

This is what I tried (code is below): I am using nltk to find everything marked as a person and then generating a list of all the NNP parts of that person. I am skipping persons where there is only one NNP which avoids grabbing a lone surname.

I am getting decent results but was wondering if there are better ways to go about solving this problem.

Code:

import nltk from nameparser.parser import HumanName  def get_human_names(text):     tokens = nltk.tokenize.word_tokenize(text)     pos = nltk.pos_tag(tokens)     sentt = nltk.ne_chunk(pos, binary = False)     person_list = []     person = []     name = ""     for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):         for leaf in subtree.leaves():             person.append(leaf[0])         if len(person) > 1: #avoid grabbing lone surnames             for part in person:                 name += part + ' '             if name[:-1] not in person_list:                 person_list.append(name[:-1])             name = ''         person = []      return (person_list)  text = """ Some economists have responded positively to Bitcoin, including  Francois R. Velde, senior economist of the Federal Reserve in Chicago  who described it as "an elegant solution to the problem of creating a  digital currency." In November 2013 Richard Branson announced that  Virgin Galactic would accept Bitcoin as payment, saying that he had invested  in Bitcoin and found it "fascinating how a whole new global currency  has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical.  Economist Paul Krugman has suggested that the structure of the currency  incentivizes hoarding and that its value derives from the expectation that  others will accept it as payment. Economist Larry Summers has expressed  a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market  strategist for ConvergEx Group, has remarked on the effect of increasing  use of Bitcoin and its restricted supply, noting, "When incremental  adoption meets relatively fixed supply, it should be no surprise that  prices go up. And that’s exactly what is happening to BTC prices." """  names = get_human_names(text) print "LAST, FIRST" for name in names:      last_first = HumanName(name).last + ', ' + HumanName(name).first         print last_first

Output:

LAST, FIRST Velde, Francois Branson, Richard Galactic, Virgin Krugman, Paul Summers, Larry Colas, Nick

Apart from Virgin Galactic, this is all valid output. Of course, knowing that Virgin Galactic isn't a human name in the context of this article is the hard (maybe impossible) part.

621

asked Nov 29 '13 17:11

e h

2 Answers

Must agree with suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included into NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk from nltk.tag.stanford import NERTagger st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') text = """YOUR TEXT GOES HERE"""  for sent in nltk.sent_tokenize(text):     tokens = nltk.tokenize.word_tokenize(sent)     tags = st.tag(tokens)     for tag in tags:         if tag[1]=='PERSON': print tag

and got not so bad output:

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

Hope this is helpful.

answered Oct 07 '22 21:10

tro

For anyone else looking, I found this article to be useful: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk >>> def extract_entities(text): ...     for sent in nltk.sent_tokenize(text): ...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))): ...             if hasattr(chunk, 'node'): ...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves()) ...

answered Oct 07 '22 19:10

Curtis Mattoon

Related questions
                            
                                Alternative implementations of python/setuptools entry points (extensions) in other languages/applications
                            
                                What does "app.run(host='0.0.0.0') " mean in Flask [duplicate]
                            
                                Uninstall python built from source?
                            
                                OData Python Library available?
                            
                                Python multiprocess profiling
                            
                                Converting a PDF to a series of images with Python
                            
                                Python spawn off a child subprocess, detach, and exit
                            
                                python tilde unary operator as negation numpy bool array
                            
                                difference between command prompt and anaconda prompt
                            
                                Python Multiprocessing Process or Pool for what I am doing?
                            
                                Mako or Jinja2? [closed]
                            
                                TypeError: 'int' object is not subscriptable
                            
                                How do I mock a django signal handler?
                            
                                Is there a way to check if NumPy arrays share the same data?
                            
                                Writing a Python extension in Go (Golang)
                            
                                SQLAlchemy - performing a bulk upsert (if exists, update, else insert) in postgresql
                            
                                Monkey patching a @property
                            
                                How to use collections.abc from both Python 3.8+ and Python 2.7
                            
                                why does my colorbar have lines in it?
                            
                                Is there a Python equivalent to Perl's Data::Dumper for inspecting data structures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving the extraction of human names with nltk [closed]

Tags:

python

nlp

nltk

e h

People also ask

2 Answers

tro

Curtis Mattoon

Recent Activity

Donate For Us