I'm thinking about use word n-grams techniques on a raw text. But I have a doubt:
does it have sense use word n-grams after applying lemma/stemming on text? If not, why should I use word n-grams only on raw files? What are pros and cons?
N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.
Lemmatization versus stemming Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach. Lemmatization has higher accuracy than stemming. Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.
Lemmatization. Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.
Computing word n-grams after lemmatization or stemming would be done for the same reasons as you would want to before stemming. Sometimes this gets you false positives, e.g., (D3) but it usually increases recall in such a meaningful way that you want to do it.
In some domains, e.g., short-text, stemming can hurt. The best thing to do is to test, but in general, I would suggest stemming and case-folding, but it really depends on your domain and queries.
Q="criminal records"
It's a precision/recall tradeoff. You can increase recall by stemming (always) and you can increase precision by not stemming. But it depends on what kinds of queries you are serving. If you're running code search, for instance, you almost never want to stem or preprocess, because users expect to type in exact symbol names and then find them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With