I am trying to parse several hundreds of sentences into their syntax trees and i need to do that fast, the problem is that if i use NLTK then i need to define a grammar, and i cant know that i only know its gonna be english. I tried using this statistical parser, and it works great for my purposes however the speed could be a lot better, is there a way to use nltk parsing without a grammar? In this snippet i am using a processing pool to do the processing in "parallel" but the speed leaves a lot to be desired.
import pickle
import re
from stat_parser.parser import Parser
from multiprocessing import Pool
import HTMLParser
def multy(a):
global parser
lst=re.findall('(\S.+?[.!?])(?=\s+|$)',a[1])
if len(lst)==0:
lst.append(a[1])
try:
ssd=parser.norm_parse(lst[0])
except:
ssd=['NNP','nothing']
with open('/var/www/html/internal','a') as f:
f.write("[[ss")
pickle.dump([a[0],ssd], f)
f.write("ss]]")
if __name__ == '__main__':
parser=Parser()
with open('/var/www/html/interface') as f:
data=f.read()
data=data.split("\n")
p = Pool(len(data))
Totalis_dict=dict()
listed=list()
h = HTMLParser.HTMLParser()
with open('/var/www/html/internal','w') as f:
f.write("")
for ind,each in enumerate(data):
listed.append([str(ind),h.unescape(re.sub('[^\x00-\x7F]+','',each))])
p.map(multy,listed)
A parse tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. A syntax tree, on the other hand, is a tree representation of the abstract syntactic structure of source code written in a programming language.
What is Parse Tree? In general, a parse tree (also called a derivation tree or concrete syntax tree) is a hierarchical structure that represents the syntactic structure according to a given grammar.
A parse tree showing the values of attributes at each node is called an Annotated parse tree. The process of computing the attributes values at the nodes is called annotating (or decorating) of the parse tree.
Parsing is a fairly computationally intensive operation. You can probably get much better performance out of a more polished parser, such as bllip. It is written in c++ and benefits from a team having worked on it over a prolonged period. There is a python module which interacts with it.
Here's an example comparing bllip and the parser you are using:
import timeit
# setup stat_parser
from stat_parser import Parser
parser = Parser()
# setup bllip
from bllipparser import RerankingParser
from bllipparser.ModelFetcher import download_and_install_model
# download model (only needs to be done once)
model_dir = download_and_install_model('WSJ', '/tmp/models')
# Loading the model is slow, but only needs to be done once
rrp = RerankingParser.from_unified_model_dir(model_dir)
sentence = "In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language."
if __name__=='__main__':
from timeit import Timer
t_bllip = Timer(lambda: rrp.parse(sentence))
t_stat = Timer(lambda: parser.parse(sentence))
print "bllip", t_bllip.timeit(number=5)
print "stat", t_stat.timeit(number=5)
And it runs about 10 times faster on my computer:
(vs)[jonathan@ ~]$ python /tmp/test.py
bllip 2.57274985313
stat 22.748554945
Also, there's a pull request pending on integrating the bllip parser into NLTK: https://github.com/nltk/nltk/pull/605
Also, you state: "i cant know that i only know its gonna be english" in your question. If by this you mean it needs to parse other languages as well, it will be much more complicated. These statistical parsers are trained on some input, often parsed content from the WSJ in the Penn TreeBanks. Some parses will provide trained models for other languages as well, but you'll need to identify the language first, and load an appropriate model into the parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With