Using the Python package spaCy, how can one detect whether a sentence uses a passive or active voice? For example, the following sentences should be detected as using a passive and active voice respectively:
passive_sentence = "John was accused of committing crimes by David"
# passive voice "John was accused"
active_sentence = "David accused John of committing crimes"
# active voice "David accused John"
The following solution employs spaCy's rule-based matching engine to detect and display the parts of a sentence that use the active or passive voice. No method is going to correctly identify 100% of sentences, especially those that are more complex, however, the solution below handles the vast majority of cases and can likely be improved to handle more edge cases.
The key components are the rules you provide to the matcher
. I'll explain one of the passive voice rules below---if you understand one, you should be able to understand all the other rules and begin to construct your own rules to match particular patterns using the spaCy token-based matching documentation. Consider the following passive voice rule:
[{'DEP': 'nsubjpass'}, {'DEP': 'aux', 'OP': '*'}, {'DEP': 'auxpass'}, {'TAG': 'VBN'}]
This rule/pattern is used by the matcher
to find a sequential combination of tokens. Specifically, the matcher
will:
OP
" stands for "operator", which defines how often a token pattern should be matched. See the operators and quantifiers subsection of the spaCy documentation for more information.If you are unfamiliar with Part of Speech (PoS) tags, please see this tutorial. Additionally, in-depth explanations of the dependency labels and what they mean are provided on the Universal Dependencies (UD) dependency documentation page.
import spacy
from spacy.matcher import Matcher
passive_sentences = [
"John was accused of committing crimes by David.",
"She was sent a cheque for a thousand euros.",
"He was given a book for his birthday.",
"He will be sent away to school.",
"The meeting was called off.",
"He was looked after by his grandmother.",
]
active_sentences = [
"David accused John of committing crimes.",
"Someone sent her a cheque for a thousand euros.",
"I gave him a book for his birthday.",
"They will send him away to school.",
"They called off the meeting.",
"His grandmother looked after him."
]
composite_sentences = [
"Three men seized me, and I was carried to the car."
]
# Load spaCy pipeline (model)
nlp = spacy.load('en_core_web_trf')
# Create pattern to match passive voice use
passive_rules = [
[{'DEP': 'nsubjpass'}, {'DEP': 'aux', 'OP': '*'}, {'DEP': 'auxpass'}, {'TAG': 'VBN'}],
[{'DEP': 'nsubjpass'}, {'DEP': 'aux', 'OP': '*'}, {'DEP': 'auxpass'}, {'TAG': 'VBZ'}],
[{'DEP': 'nsubjpass'}, {'DEP': 'aux', 'OP': '*'}, {'DEP': 'auxpass'}, {'TAG': 'RB'}, {'TAG': 'VBN'}],
]
# Create pattern to match active voice use
active_rules = [
[{'DEP': 'nsubj'}, {'TAG': 'VBD', 'DEP': 'ROOT'}],
[{'DEP': 'nsubj'}, {'TAG': 'VBP'}, {'TAG': 'VBG', 'OP': '!'}],
[{'DEP': 'nsubj'}, {'DEP': 'aux', 'OP': '*'}, {'TAG': 'VB'}],
[{'DEP': 'nsubj'}, {'DEP': 'aux', 'OP': '*'}, {'TAG': 'VBG'}],
[{'DEP': 'nsubj'}, {'TAG': 'RB', 'OP': '*'}, {'TAG': 'VBG'}],
[{'DEP': 'nsubj'}, {'TAG': 'RB', 'OP': '*'}, {'TAG': 'VBZ'}],
[{'DEP': 'nsubj'}, {'TAG': 'RB', 'OP': '+'}, {'TAG': 'VBD'}],
]
matcher = Matcher(nlp.vocab) # Init. the matcher with a vocab (note matcher vocab must share same vocab with docs)
matcher.add('Passive', passive_rules) # Add passive rules to matcher
matcher.add('Active', active_rules) # Add active rules to matcher
text = passive_sentences + active_sentences + composite_sentences # Combine various passive/active sentences
for sentence in text:
doc = nlp(sentence) # Process text with spaCy model
matches = matcher(doc) # Get matches
print("-"*40 + "\n" + sentence)
if len(matches) > 0:
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end] # the matched span
print("\t{}: {}".format(string_id, span.text))
else:
print("\tNo active or passive voice detected.")
----------------------------------------
John was accused of committing crimes by David.
Passive: John was accused
----------------------------------------
She was sent a cheque for a thousand euros.
Passive: She was sent
----------------------------------------
He was given a book for his birthday.
Passive: He was given
----------------------------------------
He will be sent away to school.
Passive: He will be sent
----------------------------------------
The meeting was called off.
Passive: meeting was called
----------------------------------------
He was looked after by his grandmother
Passive: He was looked
----------------------------------------
David accused John of committing crimes.
Active: David accused
----------------------------------------
Someone sent her a cheque for a thousand euros.
Active: Someone sent
----------------------------------------
I gave him a book for his birthday.
Active: I gave
----------------------------------------
They will send him away to school.
Active: They will send
----------------------------------------
They called off the meeting.
Active: They called
----------------------------------------
His grandmother looked after him..
Active: grandmother looked
----------------------------------------
Three men seized me, and I was carried to the car.
Active: men seized
Passive: I was carried
There is no easy solution for this. If you're looking for something simple, accuracy might take a hit. There is a wealth of info about NLP detecting passive and active voice in a text, proprietary algorithms being the most accurate, but they come at a cost.
What you're looking for, if it's for a custom hobby project, could have a quick solution trying out something like this, but if you follow the comments, you'll notice even here the accuracy is not in the double or even single 9 percentage rates.
You'll have to go more complex for higher accuracy, but don't expect double 9.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With