Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I need a tokenizer for each language? [closed]

When processing text, why would one need a tokenizer specialized for the language?

Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?

like image 784
Jack Twain Avatar asked Jun 26 '13 07:06

Jack Twain


People also ask

Why do we need tokenizer?

Tokenization helps protect business from the negative financial impacts of a data theft. Even in the case of breach, valuable personal data simply isn't there to steal. Tokenization can't protect your business from a data breach—but it can reduce the financial fallout from any potential breach.

Why does NLP require tokenization?

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

When should you use a standard tokenizer?

The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

What does tokenizer Tokenize do?

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.


1 Answers

The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:

(Missing) Spaces between words

Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts, such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without word spaces. Spaces were introduced (together with accent marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task. (MANNI:1999, p. 129)

Compounds

German compound nouns are written as a single word, e.g. "Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)). Their information-density, however, is high, and one may wish to divide such a compound, or at least to be aware of the internal structure of the word, and this becomes a limited word segmentation task.(Ibidem)

There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.

Variant styles and codings

Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:

[...] Even if one is not dealing with multilingual text, any application dealing with text from different countries or written according to different stylistic conventions has to be prepared to deal with typographical differences. In particular, some items such as phone numbers are clearly of one semantic sort, but can appear in many formats. (MANNI:1999, p. 130)

Misc.

One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).

If one wants to handle multilingual contractions/"cliticons":

  • English: They‘re my father‘s cousins.
  • French: Montrez-le à l‘agent!
  • German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)

Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:

  • Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
  • Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
  • Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.

References

(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

like image 180
Nino Avatar answered Sep 20 '22 21:09

Nino