Quick NLTK parse into syntax tree

Tags:

I am trying to parse several hundreds of sentences into their syntax trees and i need to do that fast, the problem is that if i use NLTK then i need to define a grammar, and i cant know that i only know its gonna be english. I tried using this statistical parser, and it works great for my purposes however the speed could be a lot better, is there a way to use nltk parsing without a grammar? In this snippet i am using a processing pool to do the processing in "parallel" but the speed leaves a lot to be desired.

Click to copy

import pickle
import re
from stat_parser.parser import Parser
from multiprocessing import Pool
import HTMLParser
def multy(a):
    global parser
    lst=re.findall('(\S.+?[.!?])(?=\s+|$)',a[1])
    if len(lst)==0:
        lst.append(a[1])
    try:
        ssd=parser.norm_parse(lst[0])
    except:
        ssd=['NNP','nothing']
    with open('/var/www/html/internal','a') as f:
        f.write("[[ss")
        pickle.dump([a[0],ssd], f)
        f.write("ss]]")
if __name__ == '__main__':
    parser=Parser()
    with open('/var/www/html/interface') as f:
        data=f.read()
    data=data.split("\n")
    p = Pool(len(data))
    Totalis_dict=dict()
    listed=list()
    h = HTMLParser.HTMLParser()
    with open('/var/www/html/internal','w') as f:
        f.write("")
    for ind,each in enumerate(data):
        listed.append([str(ind),h.unescape(re.sub('[^\x00-\x7F]+','',each))])
    p.map(multy,listed)

308

asked Jun 23 '14 09:06

Evan

1 Answers

Parsing is a fairly computationally intensive operation. You can probably get much better performance out of a more polished parser, such as bllip. It is written in c++ and benefits from a team having worked on it over a prolonged period. There is a python module which interacts with it.

Here's an example comparing bllip and the parser you are using:

Click to copy

import timeit

# setup stat_parser
from stat_parser import Parser
parser = Parser()

# setup bllip
from bllipparser import RerankingParser
from bllipparser.ModelFetcher import download_and_install_model
# download model (only needs to be done once)
model_dir = download_and_install_model('WSJ', '/tmp/models')
# Loading the model is slow, but only needs to be done once
rrp = RerankingParser.from_unified_model_dir(model_dir)

sentence = "In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language."

if __name__=='__main__':
    from timeit import Timer
    t_bllip = Timer(lambda: rrp.parse(sentence))
    t_stat = Timer(lambda: parser.parse(sentence))
    print "bllip", t_bllip.timeit(number=5)
    print "stat", t_stat.timeit(number=5)

And it runs about 10 times faster on my computer:

Click to copy

(vs)[jonathan@ ~]$ python /tmp/test.py 
bllip 2.57274985313
stat 22.748554945

Also, there's a pull request pending on integrating the bllip parser into NLTK: https://github.com/nltk/nltk/pull/605

Also, you state: "i cant know that i only know its gonna be english" in your question. If by this you mean it needs to parse other languages as well, it will be much more complicated. These statistical parsers are trained on some input, often parsed content from the WSJ in the Penn TreeBanks. Some parses will provide trained models for other languages as well, but you'll need to identify the language first, and load an appropriate model into the parser.

answered Sep 29 '22 22:09

Jonathan Villemaire-Krajden

Related questions
                            
                                Selenium Webdriver: execute_script can't execute custom methods and external javascript files
                            
                                Transposing arrays in an array
                            
                                beautifulsoup and invalid html document
                            
                                How do I do symmetric encryption with the python gnupg module vers. 1.2.5?
                            
                                Retrieve data from pymongo
                            
                                Django: passing AJAX POST data to Django yields MultiValueDictKeyError even though key exists
                            
                                Serialize multiple models in a single view
                            
                                How can I use a HiddenField to coerce integer data in WTForms?
                            
                                ValueError using recursive feature elimination for SVM with rbf kernel in scikit-learn
                            
                                Python - colormap in matplotlib for 3D line plot
                            
                                Interpolating a peak for two values of x - Python
                            
                                Error while executing os.getcwd()?
                            
                                scraping multiple pages with scrapy
                            
                                Is regular expression search guaranteed to return first match?
                            
                                matplotlib figures are not displayed when one types imshow(img) in the command prompt in pdb mode
                            
                                Fabric/Python: AttributeError: 'NoneType' object has no attribute 'partition'
                            
                                Pandas replace non-zero values
                            
                                NAO robot remote audio problems
                            
                                Error when "import matplotlib.pyplot as plt"
                            
                                Programmatically generate requirements.txt file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Quick NLTK parse into syntax tree

Tags:

python

nlp

nltk

Evan

People also ask

1 Answers

Jonathan Villemaire-Krajden

Recent Activity

Donate For Us