Stanford typed dependencies using coreNLP in python

Tags:

In Stanford Dependency Manual they mention "Stanford typed dependencies" and particularly the type "neg" - negation modifier. It is also available when using Stanford enhanced++ parser using the website. for example, the sentence:

"Barack Obama was not born in Hawaii"

enter image description here

The parser indeed find neg(born,not)

but when I'm using the stanfordnlp python library, the only dependency parser I can get will parse the sentence as follow:

('Barack', '5', 'nsubj:pass')

('Obama', '1', 'flat')

('was', '5', 'aux:pass')

('not', '5', 'advmod')

('born', '0', 'root')

('in', '7', 'case')

('Hawaii', '5', 'obl')

and the code that generates it:

import stanfordnlp
stanfordnlp.download('en')  
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a  = doc.sentences[0]
a.print_dependencies()

Is there a way to get similar results to the enhanced dependency parser or any other Stanford parser that result in typed dependencies that will give me the negation modifier?

312

asked Jun 10 '19 13:06

Latent

3 Answers

It is to note the python library stanfordnlp is not just a python wrapper for StanfordCoreNLP.

1. Difference StanfordNLP / CoreNLP

As said on the stanfordnlp Github repo:

The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.

Stanfordnlp contains a new set of neural networks models, trained on the CONLL 2018 shared task. The online parser is based on the CoreNLP 3.9.2 java library. Those are two different pipelines and sets of models, as explained here.

Your code only accesses their neural pipeline trained on CONLL 2018 data. This explains the differences you saw compared to the online version. Those are basically two different models.

What adds to the confusion I believe is that both repositories belong to the user named stanfordnlp (which is the team name). Don't be fooled between the java stanfordnlp/CoreNLP and the python stanfordnlp/stanfordnlp.

Concerning your 'neg' issue, it seems that in the python libabry stanfordnlp, they decided to consider the negation with an 'advmod' annotation altogether. At least that is what I ran into for a few example sentences.

2. Using CoreNLP via stanfordnlp package

However, you can still get access to the CoreNLP through the stanfordnlp package. It requires a few more steps, though. Citing the Github repo,

There are a few initial setup steps.

Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)

Put the model jars in the distribution folder

Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05

Once that is done, you can start a client, with code that can be found in the demo :

from stanfordnlp.server import CoreNLPClient 

with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    print('---')
    print('dependency parse of first sentence')
    dependency_parse = sentence.basicDependencies
    print(dependency_parse)

    #get the tokens of the first sentence
    #note that 1 token is 1 node in the parse tree, nodes start at 1
    print('---')
    print('Tokens of first sentence')
    for token in sentence.token :
        print(token)

Your sentence will therefore be parsed if you specify the 'depparse' annotator (as well as the prerequisite annotators tokenize, ssplit, and pos). Reading the demo, it feels that we can only access basicDependencies. I have not managed to make Enhanced++ dependencies work via stanfordnlp.

But the negations will still appear if you use basicDependencies !

Here is the output I obtained using stanfordnlp and your example sentence. It is a DependencyGraph object, not pretty, but it is unfortunately always the case when we use the very deep CoreNLP tools. You will see that between nodes 4 and 5 ('not' and 'born'), there is and edge 'neg'.

node {
  sentenceIndex: 0
  index: 1
}
node {
  sentenceIndex: 0
  index: 2
}
node {
  sentenceIndex: 0
  index: 3
}
node {
  sentenceIndex: 0
  index: 4
}
node {
  sentenceIndex: 0
  index: 5
}
node {
  sentenceIndex: 0
  index: 6
}
node {
  sentenceIndex: 0
  index: 7
}
node {
  sentenceIndex: 0
  index: 8
}
edge {
  source: 2
  target: 1
  dep: "compound"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 2
  dep: "nsubjpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 3
  dep: "auxpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 4
  dep: "neg"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 7
  dep: "nmod"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 8
  dep: "punct"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 7
  target: 6
  dep: "case"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
root: 5

---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false

word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false

word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false

word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false

word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false

word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false

word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false

word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false

2. Using CoreNLP via NLTK package

I will not go into details on this one, but there is also a solution to access the CoreNLP server via the NLTK library , if all else fails. It does output the negations, but requires a little more work to start the servers. Details on this page

EDIT

I figured I could also share with you the code to get the DependencyGraph into a nice list of 'dependency, argument1, argument2' in a shape similar to what stanfordnlp outputs.

from stanfordnlp.server import CoreNLPClient

text = "Barack Obama was not born in Hawaii."

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    dependency_parse = sentence.basicDependencies

    #print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
    #print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
    #print(dir(dependency_parse.edge))

    #get a dictionary associating each token/node with its label
    token_dict = {}
    for i in range(0, len(sentence.token)) :
        token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word

    #get a list of the dependencies with the words they connect
    list_dep=[]
    for i in range(0, len(dependency_parse.edge)):

        source_node = dependency_parse.edge[i].source
        source_name = token_dict[source_node]

        target_node = dependency_parse.edge[i].target
        target_name = token_dict[target_node]

        dep = dependency_parse.edge[i].dep

        list_dep.append((dep, 
            str(source_node)+'-'+source_name, 
            str(target_node)+'-'+target_name))
    print(list_dep)

It ouputs the following

[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]

118

answered Sep 21 '22 03:09

Cyrielle Mallart

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    offset = 0 # keeps track of token offset for each sentence
    for sentence in ann.sentence:
        print('___________________')
        print('dependency parse:')
        # extract dependency parse
        dp = sentence.basicDependencies
        # build a helper dict to associate token index and label
        token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
        offset += len(sentence.token)

        # build list of (source, target) pairs
        out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]

        for source, target in out_parse:
            print(source, token_dict[source], '->', target, token_dict[target])

        print('\nTokens \t POS \t NER')
        for token in sentence.token:
            print (token.word, '\t', token.pos, '\t', token.ner)

This outputs the following for the first sentence:

___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in

Tokens   POS     NER
Barack   NNP     PERSON
Obama    NNP     PERSON
was      VBD     O
born     VBN     O
in       IN      O
Hawaii   NNP     STATE_OR_PROVINCE
.        .       O

answered Sep 22 '22 03:09

Mazidi

I believe there is likely a discrepancy between the model which was used to generate dependencies for documentation and the one that is available online hence the difference. I would raise the issue with stanfordnlp library maintainers directly via GitHub issues.

answered Sep 21 '22 03:09

sophros

Related questions
                            
                                Is `list()` considered a function?
                            
                                Cross tabulate counts between pairs of keywords per group with pandas
                            
                                Renaming columns in dask dataframe
                            
                                Pandas check time series continuity
                            
                                Python protobuf "from google.protobuf.pyext import _message" - "ImportError: DLL load failed: The specified procedure could not be found"
                            
                                Got The included URLconf does not appear to have any patterns in it
                            
                                Increase Font Size of Chart Title in Altair
                            
                                How to unset `sharex` or `sharey` from two axes in Matplotlib
                            
                                pass url parameter to serializer
                            
                                Importing CSV file into Google Colab using numpy loadtxt
                            
                                Filter a list of tuples based on condition
                            
                                Pandas read_sql() - AttributeError: 'Engine' object has no attribute 'cursor'
                            
                                How to deploy pyside2 applications? - The Qt way
                            
                                How to upload app bundle (.aab) to play store using Google Play Publisher API
                            
                                Pythonic conversion to singleton iterable if not already an iterable
                            
                                How do I fix AttributeError: 'bytes' object has no attribute 'encode'?
                            
                                pandas dataframe: how to aggregate a subset of rows based on value of a column
                            
                                Video Streaming from IP Camera in Python Using OpenCV cv2.VideoCapture
                            
                                Tensorflow/keras: "logits and labels must have the same first dimension" How to squeeze logits or expand labels?
                            
                                Import Error: 'scipy.misc import imsave' on Google Colaboratory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With