Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Stanford Parser in NLTK using Python

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

like image 897
ThanaDaray Avatar asked Dec 14 '12 17:12

ThanaDaray


People also ask

What is Stanford parser?

The parser provides Universal Dependencies (v1) and Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese.

What is parsing NLTK?

NLTK Parsers. Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text's “parses”.


Video Answer


15 Answers

Note that this answer applies to NLTK v 3.0, and not to more recent versions.

Sure, try the following in Python:

import os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = '/path/to/standford/jars'
os.environ['STANFORD_MODELS'] = '/path/to/standford/jars'

parser = stanford.StanfordParser(model_path="/location/of/the/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences

# GUI
for line in sentences:
    for sentence in line:
        sentence.draw()

Output:

[Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]

Note 1: In this example both the parser & model jars are in the same folder.

Note 2:

  • File name of stanford parser is: stanford-parser.jar
  • File name of stanford models is: stanford-parser-x.x.x-models.jar

Note 3: The englishPCFG.ser.gz file can be found inside the models.jar file (/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz). Please use come archive manager to 'unzip' the models.jar file.

Note 4: Be sure you are using Java JRE (Runtime Environment) 1.8 also known as Oracle JDK 8. Otherwise you will get: Unsupported major.minor version 52.0.

Installation

  1. Download NLTK v3 from: https://github.com/nltk/nltk. And install NLTK:

    sudo python setup.py install

  2. You can use the NLTK downloader to get Stanford Parser, using Python:

    import nltk
    nltk.download()
    
  3. Try my example! (don't forget the change the jar paths and change the model path to the ser.gz location)

OR:

  1. Download and install NLTK v3, same as above.

  2. Download the latest version from (current version filename is stanford-parser-full-2015-01-29.zip): http://nlp.stanford.edu/software/lex-parser.shtml#Download

  3. Extract the standford-parser-full-20xx-xx-xx.zip.

  4. Create a new folder ('jars' in my example). Place the extracted files into this jar folder: stanford-parser-3.x.x-models.jar and stanford-parser.jar.

    As shown above you can use the environment variables (STANFORD_PARSER & STANFORD_MODELS) to point to this 'jars' folder. I'm using Linux, so if you use Windows please use something like: C://folder//jars.

  5. Open the stanford-parser-3.x.x-models.jar using an Archive manager (7zip).

  6. Browse inside the jar file; edu/stanford/nlp/models/lexparser. Again, extract the file called 'englishPCFG.ser.gz'. Remember the location where you extract this ser.gz file.

  7. When creating a StanfordParser instance, you can provide the model path as parameter. This is the complete path to the model, in our case /location/of/englishPCFG.ser.gz.

  8. Try my example! (don't forget the change the jar paths and change the model path to the ser.gz location)

like image 161
Melroy van den Berg Avatar answered Oct 03 '22 15:10

Melroy van den Berg


Deprecated Answer

The answer below is deprecated, please use the solution on https://stackoverflow.com/a/51981566/610569 for NLTK v3.3 and above.


EDITED

Note: The following answer will only work on:

  • NLTK version ==3.2.5
  • Stanford Tools compiled since 2016-10-31
  • Python 2.7, 3.5 and 3.6

As both tools changes rather quickly and the API might look very different 3-6 months later. Please treat the following answer as temporal and not an eternal fix.

Always refer to https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software for the latest instruction on how to interface Stanford NLP tools using NLTK!!

TL;DR

The follow code comes from https://github.com/nltk/nltk/pull/1735#issuecomment-306091826

In terminal:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000

In Python:

>>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
>>> from nltk.parse.corenlp import CoreNLPParser

>>> stpos, stner = CoreNLPPOSTagger(), CoreNLPNERTagger()

>>> stpos.tag('What is the airspeed of an unladen swallow ?'.split())
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]

>>> stner.tag('Rami Eid is studying at Stony Brook University in NY'.split())
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]


>>> parser = CoreNLPParser(url='http://localhost:9000')

>>> next(
...     parser.raw_parse('The quick brown fox jumps over the lazy dog.')
... ).pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|__________     |    |     _______|____    |
 DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |    |     |    |    |    |    |       |    |   |
The quick brown fox jumps over the     lazy dog  .

>>> (parse_fox, ), (parse_wolf, ) = parser.raw_parse_sents(
...     [
...         'The quick brown fox jumps over the lazy dog.',
...         'The quick grey wolf jumps over the lazy fox.',
...     ]
... )

>>> parse_fox.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|__________     |    |     _______|____    |
 DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |    |     |    |    |    |    |       |    |   |
The quick brown fox jumps over the     lazy dog  .

>>> parse_wolf.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|_________      |    |     _______|____    |
 DT   JJ   JJ   NN   VBZ   IN   DT      JJ   NN  .
 |    |    |    |     |    |    |       |    |   |
The quick grey wolf jumps over the     lazy fox  .

>>> (parse_dog, ), (parse_friends, ) = parser.parse_sents(
...     [
...         "I 'm a dog".split(),
...         "This is my friends ' cat ( the tabby )".split(),
...     ]
... )

>>> parse_dog.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
        ROOT
         |
         S
  _______|____
 |            VP
 |    ________|___
 NP  |            NP
 |   |         ___|___
PRP VBP       DT      NN
 |   |        |       |
 I   'm       a      dog

Please take a look at http://www.nltk.org/_modules/nltk/parse/corenlp.html for more information on of the Stanford API. Take a look at the docstrings!

like image 25
alvas Avatar answered Sep 30 '22 15:09

alvas


There is python interface for stanford parser

http://projects.csail.mit.edu/spatial/Stanford_Parser

like image 28
Rohith Avatar answered Oct 04 '22 15:10

Rohith


The Stanford Core NLP software page has a list of python wrappers:

http://nlp.stanford.edu/software/corenlp.shtml#Extensions

like image 36
silverasm Avatar answered Oct 03 '22 15:10

silverasm


If I remember well, the Stanford parser is a java library, therefore you must have a Java interpreter running on your server/computer.

I used it once a server, combined with a php script. The script used php's exec() function to make a command-line call to the parser like so:

<?php

exec( "java -cp /pathTo/stanford-parser.jar -mx100m edu.stanford.nlp.process.DocumentPreprocessor /pathTo/fileToParse > /pathTo/resultFile 2>/dev/null" );

?>

I don't remember all the details of this command, it basically opened the fileToParse, parsed it, and wrote the output in the resultFile. PHP would then open the result file for further use.

The end of the command directs the parser's verbose to NULL, to prevent unnecessary command line information from disturbing the script.

I don't know much about Python, but there might be a way to make command line calls.

It might not be the exact route you were hoping for, but hopefully it'll give you some inspiration. Best of luck.

like image 31
bob dope Avatar answered Sep 30 '22 15:09

bob dope


Note that this answer applies to NLTK v 3.0, and not to more recent versions.

Here is an adaptation of danger98's code that works with nltk3.0.0 on windoze, and presumably the other platforms as well, adjust directory names as appropriate for your setup:

import os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = 'd:/stanford-parser'
os.environ['STANFORD_MODELS'] = 'd:/stanford-parser'
os.environ['JAVAHOME'] = 'c:/Program Files/java/jre7/bin'

parser = stanford.StanfordParser(model_path="d:/stanford-grammars/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences

Note that the parsing command has changed (see the source code at www.nltk.org/_modules/nltk/parse/stanford.html), and that you need to define the JAVAHOME variable. I tried to get it to read the grammar file in situ in the jar, but have so far failed to do that.

like image 43
Avery Andrews Avatar answered Oct 02 '22 15:10

Avery Andrews


You can use the Stanford Parsers output to create a Tree in nltk (nltk.tree.Tree).

Assuming the stanford parser gives you a file in which there is exactly one parse tree for every sentence. Then this example works, though it might not look very pythonic:

f = open(sys.argv[1]+".output"+".30"+".stp", "r")
parse_trees_text=[]
tree = ""
for line in f:
  if line.isspace():
    parse_trees_text.append(tree)
tree = ""
  elif "(. ...))" in line:
#print "YES"
tree = tree+')'
parse_trees_text.append(tree)
tree = ""
  else:
tree = tree + line

parse_trees=[]
for t in parse_trees_text:
  tree = nltk.Tree(t)
  tree.__delitem__(len(tree)-1) #delete "(. .))" from tree (you don't need that)
  s = traverse(tree)
  parse_trees.append(tree)
like image 22
Sadik Avatar answered Oct 02 '22 15:10

Sadik


Note that this answer applies to NLTK v 3.0, and not to more recent versions.

Since nobody really mentioned and it's somehow troubled me a lot, here is an alternative way to use Stanford parser in python:

stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar'    
parser = StanfordParser(path_to_jar=stanford_parser_jar, 
                        path_to_models_jar=stanford_model_jar)

in this way, you don't need to worry about the path thing anymore.

For those who cannot use it properly on Ubuntu or run the code in Eclipse.

like image 22
Zhong Zhu Avatar answered Sep 30 '22 15:09

Zhong Zhu


I am on a windows machine and you can simply run the parser normally as you do from the command like but as in a different directory so you don't need to edit the lexparser.bat file. Just put in the full path.

cmd = r'java -cp \Documents\stanford_nlp\stanford-parser-full-2015-01-30 edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "typedDependencies" \Documents\stanford_nlp\stanford-parser-full-2015-01-30\stanford-parser-3.5.1-models\edu\stanford\nlp\models\lexparser\englishFactored.ser.gz stanfordtemp.txt'
parse_out = os.popen(cmd).readlines()

The tricky part for me was realizing how to run a java program from a different path. There must be a better way but this works.

like image 33
Ted Petrou Avatar answered Oct 03 '22 15:10

Ted Petrou


Note that this answer applies to NLTK v 3.0, and not to more recent versions.

A slight update (or simply alternative) on danger89's comprehensive answer on using Stanford Parser in NLTK and Python

With stanford-parser-full-2015-04-20, JRE 1.8 and nltk 3.0.4 (python 2.7.6), it appears that you no longer need to extract the englishPCFG.ser.gz from stanford-parser-x.x.x-models.jar or setting up any os.environ

from nltk.parse.stanford import StanfordParser

english_parser = StanfordParser('path/stanford-parser.jar', 'path/stanford-parser-3.5.2-models.jar')

s = "The real voyage of discovery consists not in seeking new landscapes, but in having new eyes."

sentences = english_parser.raw_parse_sents((s,))
print sentences #only print <listiterator object> for this version

#draw the tree
for line in sentences:
    for sentence in line:
        sentence.draw()
like image 32
SYK Avatar answered Oct 04 '22 15:10

SYK


Note that this answer applies to NLTK v 3.0, and not to more recent versions.

Here is the windows version of alvas's answer

sentences = ('. '.join(['this is sentence one without a period','this is another foo bar sentence '])+'.').encode('ascii',errors = 'ignore')
catpath =r"YOUR CURRENT FILE PATH"

f = open('stanfordtemp.txt','w')
f.write(sentences)
f.close()

parse_out = os.popen(catpath+r"\nlp_tools\stanford-parser-2010-08-20\lexparser.bat "+catpath+r"\stanfordtemp.txt").readlines()

bracketed_parse = " ".join( [i.strip() for i in parse_out if i.strip() if i.strip()[0] == "("] )
bracketed_parse = "\n(ROOT".join(bracketed_parse.split(" (ROOT")).split('\n')
aa = map(lambda x :ParentedTree.fromstring(x),bracketed_parse)

NOTES:

  • In lexparser.bat you need to change all the paths into absolute path to avoid java errors such as "class not found"

  • I strongly recommend you to apply this method under windows since I Tried several answers on the page and all the methods communicates python with Java fails.

  • wish to hear from you if you succeed on windows and wish you can tell me how you overcome all these problems.

  • search python wrapper for stanford coreNLP to get the python version


like image 20
redreamality Avatar answered Oct 01 '22 15:10

redreamality


I took many hours and finally found a simple solution for Windows users. Basically its summarized version of an existing answer by alvas, but made easy to follow(hopefully) for those who are new to stanford NLP and are Window users.

1) Download the module you want to use, such as NER, POS etc. In my case i wanted to use NER, so i downloaded the module from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip

2) Unzip the file.

3) Set the environment variables(classpath and stanford_modules) from the unzipped folder.

import os
os.environ['CLASSPATH'] = "C:/Users/Downloads/stanford-ner-2015-04-20/stanford-ner.jar"
os.environ['STANFORD_MODELS'] = "C:/Users/Downloads/stanford-ner-2015-04-20/classifiers/"

4) set the environment variables for JAVA, as in where you have JAVA installed. for me it was below

os.environ['JAVAHOME'] = "C:/Program Files/Java/jdk1.8.0_102/bin/java.exe"

5) import the module you want

from nltk.tag import StanfordNERTagger

6) call the pretrained model which is present in classifier folder in the unzipped folder. add ".gz" in the end for file extension. for me the model i wanted to use was english.all.3class.distsim.crf.ser

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

7) Now execute the parser!! and we are done!!

st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
like image 42
StatguyUser Avatar answered Oct 03 '22 15:10

StatguyUser


Note that this answer applies to NLTK v 3.0, and not to more recent versions.

I cannot leave this as a comment because of reputation, but since I spent (wasted?) some time solving this I would rather share my problem/solution to get this parser to work in NLTK.

In the excellent answer from alvas, it is mentioned that:

e.g. for the Parser, there won't be a model directory.

This led me wrongly to:

  • not be careful to the value I put to STANFORD_MODELS (and only care about my CLASSPATH)
  • leave ../path/tostanford-parser-full-2015-2012-09/models directory * virtually empty* (or with a jar file whose name did not match nltk regex)!

If the OP, like me, just wanted to use the parser, it may be confusing that when not downloading anything else (no POStagger, no NER,...) and following all these instructions, we still get an error.

Eventually, for any CLASSPATH given (following examples and explanations in answers from this thread) I would still get the error:

NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar,

see: http://nlp.stanford.edu/software/lex-parser.shtml

OR:

NLTK was unable to find stanford-parser.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser.jar, see: http://nlp.stanford.edu/software/lex-parser.shtml

Though, importantly, I could correctly load and use the parser if I called the function with all arguments and path fully specified, as in:

stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanfor-parser-3.5.2-models.jar'    
parser = StanfordParser(path_to_jar=stanford_parser_jar, 
                    path_to_models_jar=stanford_model_jar)

Solution for Parser alone:

Therefore the error came from NLTK and how it is looking for jars using the supplied STANFORD_MODELS and CLASSPATH environment variables. To solve this, the *-models.jar, with the correct formatting (to match the regex in NLTK code, so no -corenlp-....jar) must be located in the folder designated by STANFORD_MODELS.

Namely, I first created:

mkdir stanford-parser-full-2015-12-09/models

Then added in .bashrc:

export STANFORD_MODELS=/path/to/stanford-parser-full-2015-12-09/models

And finally, by copying stanford-parser-3.6.0-models.jar (or corresponding version), into:

path/to/stanford-parser-full-2015-12-09/models/

I could get StanfordParser to load smoothly in python with the classic CLASSPATH that points to stanford-parser.jar. Actually, as such, you can call StanfordParser with no parameters, the default will just work.

like image 27
H. Rev. Avatar answered Oct 02 '22 15:10

H. Rev.


I am using nltk version 3.2.4. And following code worked for me.

from nltk.internals import find_jars_within_path
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

# Alternatively to setting the CLASSPATH add the jar and model via their 
path:
jar = '/home/ubuntu/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
model = '/home/ubuntu/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar)

# Add other jars from Stanford directory
stanford_dir = pos_tagger._stanford_jar.rpartition('/')[0]
stanford_jars = find_jars_within_path(stanford_dir)
pos_tagger._stanford_jar = ':'.join(stanford_jars)

text = pos_tagger.tag(word_tokenize("Open app and play movie"))
print(text)

Output:

[('Open', 'VB'), ('app', 'NN'), ('and', 'CC'), ('play', 'VB'), ('movie', 'NN')]
like image 37
Aditi Avatar answered Oct 04 '22 15:10

Aditi


A new development of the Stanford parser based on a neural model, trained using Tensorflow is very recently made available to be used as a python API. This model is supposed to be far more accurate than the Java-based moel. You can certainly integrate with an NLTK pipeline.

Link to the parser. Ther repository contains pre-trained parser models for 53 languages.

like image 29
0x5050 Avatar answered Oct 04 '22 15:10

0x5050