Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute word n-grams on original text or after lemma/stemming process?

I'm thinking about use word n-grams techniques on a raw text. But I have a doubt:

does it have sense use word n-grams after applying lemma/stemming on text? If not, why should I use word n-grams only on raw files? What are pros and cons?

like image 698
Alessandro Avatar asked Nov 10 '17 09:11

Alessandro


People also ask

What the n-gram is in text processing?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

Which is better lemmatization vs stemming?

Lemmatization versus stemming Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach. Lemmatization has higher accuracy than stemming. Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.

What is lemmatization and stemming?

Lemmatization. Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.


1 Answers

Computing word n-grams after lemmatization or stemming would be done for the same reasons as you would want to before stemming. Sometimes this gets you false positives, e.g., (D3) but it usually increases recall in such a meaningful way that you want to do it.

In some domains, e.g., short-text, stemming can hurt. The best thing to do is to test, but in general, I would suggest stemming and case-folding, but it really depends on your domain and queries.

Q="criminal records"

  • D1 = "... has a criminal record ..." (match on stem)
  • D2 = "... released the criminal records ..." (match normally)
  • D3 = "... while working on 'Smooth Criminal', recording ..." (false match on stem)

It's a precision/recall tradeoff. You can increase recall by stemming (always) and you can increase precision by not stemming. But it depends on what kinds of queries you are serving. If you're running code search, for instance, you almost never want to stem or preprocess, because users expect to type in exact symbol names and then find them.

like image 185
John Foley Avatar answered Sep 30 '22 08:09

John Foley