Why do I need a tokenizer for each language? [closed]

1 Answers

The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:

(Missing) Spaces between words

Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts, such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without word spaces. Spaces were introduced (together with accent marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task. (MANNI:1999, p. 129)

Compounds

German compound nouns are written as a single word, e.g. "Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)). Their information-density, however, is high, and one may wish to divide such a compound, or at least to be aware of the internal structure of the word, and this becomes a limited word segmentation task.(Ibidem)

There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.

Variant styles and codings

Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:

[...] Even if one is not dealing with multilingual text, any application dealing with text from different countries or written according to different stylistic conventions has to be prepared to deal with typographical differences. In particular, some items such as phone numbers are clearly of one semantic sort, but can appear in many formats. (MANNI:1999, p. 130)

Misc.

One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).

If one wants to handle multilingual contractions/"cliticons":

English: They‘re my father‘s cousins.
French: Montrez-le à l‘agent!
German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)

Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:

Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.

References

(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

180

answered Sep 20 '22 21:09

Nino

Related questions
                            
                                Android draw text into rectangle on center and crop it if needed
                            
                                Styled text in JavaFX?
                            
                                How to have images in line with text in css
                            
                                IntelliJ File mask not working on simple excluding file pattern! Why?
                            
                                Extracting noun phrases from a text file using stanford typed parser
                            
                                Programmatically converting/parsing LaTeX code to plain text
                            
                                How to draw(edit) an ascii git tree
                            
                                Javascript - change paragraph Text on Each Button Click
                            
                                D3 Force-directed graph using texts instead of nodes
                            
                                Algorithms/theory behind predictive autocomplete?
                            
                                Using iconv to convert from UTF-16BE to UTF-8 without BOM
                            
                                Get the value of link text when clicked in a textview in android
                            
                                How to convert a hex string to text in R?
                            
                                How to use NLP to separate a unstructured text content into distinct paragraphs?
                            
                                Reading a text file backwards in C
                            
                                pil draw text with different colors
                            
                                *nix: perform set union/intersection/difference of lists
                            
                                iOS - Unable to set text of UILabel
                            
                                How to have image + text in one button in Tkinter
                            
                                Is it possible to use Google BERT to calculate similarity between two textual documents?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do I need a tokenizer for each language? [closed]

Tags:

text

semantics

lucene

nlp

Jack Twain

People also ask

1 Answers

Nino

Recent Activity

Donate For Us