I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below. But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing? <pre class="prettyprint"><code>import nltk import pandas as pd myData = pd.read_excel("\User\train_.xlsx") texts = myData['message'] # Defining a grammar & Parser NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}" chunkr = nltk.RegexpParser(NP) tokens = [nltk.word_tokenize(i) for i in texts] tag_list = [nltk.pos_tag(w) for w in tokens] phrases = [chunkr.parse(sublist) for sublist in tag_list] leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases] </code></pre> flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples <pre class="prettyprint"><code>leaves = [tupls for sublists in leaves for tupls in sublists] </code></pre> Join the extracted terms into one bigram <pre class="prettyprint"><code>nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves] </code></pre>

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps. With <code>ne_chunk</code> and solution from <ul> <li>NLTK Named Entity recognition to a Python list and</li> <li>How can I extract GPE(location) using NLTK ne_chunk?</li> </ul> [code]: <pre class="prettyprint"><code>from nltk import word_tokenize, pos_tag, ne_chunk from nltk import RegexpParser from nltk import Tree import pandas as pd def get_continuous_chunks(text, chunk_func=ne_chunk): chunked = chunk_func(pos_tag(word_tokenize(text))) continuous_chunk = [] current_chunk = [] for subtree in chunked: if type(subtree) == Tree: current_chunk.append(" ".join([token for token, pos in subtree.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: continue return continuous_chunk df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 'Another bar foo Washington DC thingy with Bruce Wayne.']}) df['text'].apply(lambda sent: get_continuous_chunks((sent))) </code></pre> [out]: <pre class="prettyprint"><code>0 [New York] 1 [Washington, Bruce Wayne] Name: text, dtype: object </code></pre> To use the custom <code>RegexpParser</code> : <pre class="prettyprint"><code>from nltk import word_tokenize, pos_tag, ne_chunk from nltk import RegexpParser from nltk import Tree import pandas as pd # Defining a grammar & Parser NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}" chunker = RegexpParser(NP) def get_continuous_chunks(text, chunk_func=ne_chunk): chunked = chunk_func(pos_tag(word_tokenize(text))) continuous_chunk = [] current_chunk = [] for subtree in chunked: if type(subtree) == Tree: current_chunk.append(" ".join([token for token, pos in subtree.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: continue return continuous_chunk df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 'Another bar foo Washington DC thingy with Bruce Wayne.']}) df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse)) </code></pre> [out]: <pre class="prettyprint"><code>0 [bar sentence, New York city] 1 [bar foo Washington DC thingy, Bruce Wayne] Name: text, dtype: object </code></pre>

I suggest referring to this prior thread: Extracting all Nouns from a text file using nltk They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question. <pre class="prettyprint"><code>from textblob import TextBlob txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.""" blob = TextBlob(txt) print(blob.noun_phrases) </code></pre>

Python (NLTK) - more efficient way to extract noun phrases?

Tags:

python-3.x

pandas

nlp

nltk

text-chunking

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.

But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?

import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples

leaves = [tupls for sublists in leaves for tupls in sublists]

Join the extracted terms into one bigram

nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]

522

asked Mar 29 '18 20:03

Silent-J

2 Answers

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

With ne_chunk and solution from

NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?

[code]:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[out]:

0                   [New York]
1    [Washington, Bruce Wayne]
Name: text, dtype: object

To use the custom RegexpParser :

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk


df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})


df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[out]:

0                  [bar sentence, New York city]
1    [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object

100

answered Nov 15 '22 14:11

alvas

I suggest referring to this prior thread: Extracting all Nouns from a text file using nltk

They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.

from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)

answered Nov 15 '22 15:11

Aldorath

Related questions
                            
                                AttributeError: 'map' obejct has no attribute 'index' (python 3)
                            
                                SqlAlchemy TIMESTAMP 'on update' extra
                            
                                How can I enable xrange in Python 3 for portability?
                            
                                Python3 syntax in PyCharm
                            
                                PyInstaller cannot add .txt files
                            
                                How to create space between subplots? [duplicate]
                            
                                Python Pickle call constructor
                            
                                Converting numbers to particular string format in a pandas DataFrame
                            
                                How to explicitly set carriage return when doing json.dump?
                            
                                How can I make a disable field with modelform after django 1.9 [duplicate]
                            
                                How to tell if Python module is a namespace module
                            
                                Unresolved reference when calling a global variable?
                            
                                Access IP camera with OpenCV
                            
                                python -m: Error while finding module specification
                            
                                How to wrap text for an entire column using pandas?
                            
                                Neither DSN nor SERVER keyword supplied
                            
                                Display list result horizontally in python 3 rather than vertically
                            
                                Remove empty entries during string split
                            
                                Python - How can I completely uninstall Anaconda on Windows 10?
                            
                                How to import firefox cookies to python requests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With