Extracting nationalities and countries from text

Tags:

I want to extract all country and nationality mentions from text using nltk, I used POS tagging to extract all GPE labeled tokens but the results were not satisfying.

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

The results obtained are :

['Thyroid', 'Australian', 'Caucasian', 'Graves']

Some are nationalities but others are just nouns.

So what am I doing wrong or is there another way to extract such info?

552

asked Jun 17 '16 16:06

user6453258

3 Answers

If you want the country names to be extracted, what you need is NER tagger, not POS tagger.

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Check out Stanford NER tagger!

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

111

answered Oct 14 '22 06:10

aerin

So after the fruitful comments, I digged deeper into different NER tools to find the best in recognizing nationalities and country mentions and found that SPACY has a NORP entity that extracts nationalities efficiently. https://spacy.io/docs/usage/entity-recognition

answered Oct 14 '22 06:10

user6453258

Here's geograpy that uses NLTK to perform entity extraction. It stores all places and locations as a gazetteer. It then performs a lookup on the gazetteer to fetch relevant places and locations. Look up the docs for more usage details -

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

answered Oct 14 '22 07:10

Ic3fr0g

Related questions
                            
                                Why does `mylist[:] = reversed(mylist)` work?
                            
                                how to simplify use of pathlib objects to work with functions looking for strings
                            
                                Argparse: How to accept any number of optional arguments (starting with `-` or `--`)
                            
                                Cython: templates in python class wrappers
                            
                                Subprocess on remote server
                            
                                How to get a tuple out of a generator? Best Practice
                            
                                Execute coroutine from `call_soon` callback function
                            
                                Set two matplotlib imshow plots to have the same color map scale
                            
                                How to pickle and unpickle
                            
                                Windows missing Python.h
                            
                                Get data from <script> tag in HTML using Scrapy
                            
                                Simple Python server to process GET and POST requests with JSON
                            
                                Does Indexing makes Slice of pandas dataframe faster?
                            
                                flask Template Inheritance tutorial
                            
                                Flask : sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) relation "users" does not exist
                            
                                Unknown column '' in 'field list'. Django
                            
                                Force SymPy to keep the order of terms
                            
                                Computing separate tfidf scores for two different columns using sklearn
                            
                                Python: threads can only be started once
                            
                                What is default header content-type for http post method in curl form? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting nationalities and countries from text

Tags:

python

nlp

nltk

pos-tagger

user6453258

People also ask

3 Answers

aerin

user6453258

Ic3fr0g

Recent Activity

Donate For Us