When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
Tokenization helps protect business from the negative financial impacts of a data theft. Even in the case of breach, valuable personal data simply isn't there to steal. Tokenization can't protect your business from a data breach—but it can reduce the financial fallout from any potential breach.
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts, such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without word spaces. Spaces were introduced (together with accent marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task. (MANNI:1999, p. 129)
Compounds
German compound nouns are written as a single word, e.g. "Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)). Their information-density, however, is high, and one may wish to divide such a compound, or at least to be aware of the internal structure of the word, and this becomes a limited word segmentation task.(Ibidem)
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
[...] Even if one is not dealing with multilingual text, any application dealing with text from different countries or written according to different stylistic conventions has to be prepared to deal with typographical differences. In particular, some items such as phone numbers are clearly of one semantic sort, but can appear in many formats. (MANNI:1999, p. 130)
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With