Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents:

  • byte-level n-grams
  • character-level n-grams
  • word-level n-grams

It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".

Are there other criteria to consider for choosing the "right" representation?

like image 744
usual me Avatar asked Feb 09 '14 08:02

usual me


2 Answers

Evaluate. The criterion for choosing the representation is whatever works.

Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.

On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".

If you are not processing international language, bytes may be meaningful, too.

like image 109
Has QUIT--Anony-Mousse Avatar answered Oct 08 '22 05:10

Has QUIT--Anony-Mousse


I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.

Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.

As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.

As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.

There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.

You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.

like image 36
Vsevolod Dyomkin Avatar answered Oct 08 '22 06:10

Vsevolod Dyomkin