Is there a way to write a rule based system to catch things like start/end dates from a contract text. Here are a few real examples. I am bolding the date entities which I want spacy to automatically detect. If you have other ideas different than spacy that is also OK!
The initial term of this Lease shall be for a period of Five (5) years commencing on
February 1, 2012
, (the “Lease Commencement Date”) and expiring on January 31, 2017
(the “Initial Lease Term”).
Term: One (1) year commencing January 1, 2007
("Commencement Date") and ending
December 31, 2007
("Expiration Date").
This Lease Agreement is entered into for term of 15 years, beginning January 1, 2014
and ending on December 31, 2028
.
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.
The Spacy NER system contains a word embedding strategy using sub word features and "Bloom" embed, and a deep convolution neural network with residual connections. The system is designed to give a good balance of efficiency, accuracy and adaptability.
Dependency Parsing Using spaCy Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence.
I think you have to make a clear distinction between two types of methods:
1) Statistical models / Machine Learning, a.k.a. NER models. These will take the context of the sentence into account when trying to figure out whether a specific token, or multiple consecutive tokens, are a date. spaCy has pre-built NER models you can download to try out on your specific data. You'll want to look for those entities (in doc.ents
) that have ent.label_ == DATE
. Once you have those entities, you can run them through a date parser to understand what the actual date is. See also here for more information.
2) Rule-based entity recognition. Here, you have to define the rules yourself by specifying how you expect your date will look like, e.g. XX/XX/XXXX
with X
being a digit. As user1558604 pointed out though, you'll have to write multiple different rules if you want to recognize different representations of dates. You can find an overview of spaCy's rule-based matching methods here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With