How can I get my regular expression to match against just one condition exactly?
For example I have the following regular expression:
(\w+)(?=ly|es|s|y)
Matching the expression against the word "glasses" returns:
glasse
The correct match should be:
glass (match should be on 'es' rather than 's' as in the match above)
The expression should cater for any kinds of words such as:
films
lovely
glasses
glass
Currently the regular expression is matching the above words as:
film - correct
lovel - incorrect
glasse - incorrect
glas - incorrect
The correct match for the words should be:
film
love
glass
glass
The problem I am having at the moment is I am not sure how to adjust my regular expression to cater for either 's' or 'es' exactly, as a word could contain both such as "glasses".
Thank you for the answers so far. I appreciate the complexity of stemming and the requirement of language knowledge. However in my particular case the words are finite (films,lovely,glasses and glass) and so therefore I will only ever encounter these words and the suffixes in the expression above. I don't have a particular application for this. I was just curious to see if it was possible using regular expressions. I have come to the conclusion that it is not possible, however would the following be possible:
A match is either found or not found, for example match glasses but NOT glass but DO match films:
film (match) - (films)
glass (match) - (glasses)
glass (no match) - (glass)
What I'm thinking is if there is a way to match the suffix exactly against the string from the end. In the example above 'es' match glass(es) therefore the condition 's' is discarded. In the case of glass (no match) the condition 's' is discarded because another 's' precedes it, it does not match exactly. I must admit I'm not 100% about this so my logic may seem a little shakey, it's just an idea.
If you want to do stemming, use a library like Snowball. It's going to be impossible to do what you want to do with regular expressions. In particular, it will be impossible for your regex to know that the trailing 's' should be removed from 'films' but not 'glass' without some kind of knowledge of the language.
There's vast literature on stemming and lemmatization. Google is your friend.
The basic problem you're having here is that the plus in
(\w+)(?=ly|es|s|y)
is greedy, and will grab as much as possible while still allowing the whole regex to match. You've not said exactly which flavour of regex you're using but try
(\w+?)(?=ly|es|s|y)
+?
means the same as +
but is reluctant, matching as little as possible while still allowing the overall match to succeed.
However this would still have the problem that it splits glass
into glas
and s
. To handle this you'd need something like
(\w+?)(?=ly|es|(?<!s)s|y)
using negative look behind to prevent the s
alternative from matching when preceded by another s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With