Limitations of and alternatives to tries in languages other than English?

Tags:

The trie data structure is often a great way to store strings in English. It works by building a tree where each edge is labeled with a letter, and the path to a marked node in the tree spells out one of the words in the data structure.

This data structure works well in English because there are "only" 26 letters in the English alphabet (a "reasonable" branching factor), those characters have consecutive ASCII values (so the child pointers can be stored in an array keyed by the index of the letters used by each child), and there are many English words with common prefixes (so there is a lot of redundancy in the structure).

I'm a native English speaker with only limited knowledge of other languages and alphabets, but it seems like many of these properties don't hold in other languages. I know that French, Spanish, German, and Hungarian, for example, frequently use accented characters that aren't stored continuously with the remaining letters in Unicode space. Hebrew and Arabic have vowel markings that are usually indicated above or below each letter. Chinese uses a logogram system, and Korean Hangul characters consist of triples of smaller characters grouped together.

Do tries still work well for data stored in these languages and alphabets? What changes, if any, are necessary to use tries for this sort of data? Are there any data structures that work well for strings in those languages and alphabets that are particularly well-suited to them but would not be useful or efficient in English?

500

asked Dec 04 '14 21:12

templatetypedef

2 Answers

I've found that tries work well for Western European languages, as well as for Cyrillic and many other alphabetic languages. Come to think of it, the only languages I had trouble with were Chinese, Japanese, and other logographic writing systems. And for those, the trie was useless.

The sequential Unicode values of English characters isn't really a huge benefit. Although it suggests the simple node implementation:

CharNode
    char
    array[26] of CharNode

That structure isn't particularly helpful. It can make things faster, but at a fairly high memory cost. Even at the second level of a trie, that array is remarkably sparse. By the time you get to the fourth or fifth level, it's almost all dead space. I did an analysis of that at one point. I'll look around and see if I still have the numbers.

I found it nearly as fast to have a variable-length array in the node, with items ordered by frequency. Beyond the second or third level of the trie, the character I was looking for was almost always in the first or second position in that array. And the space savings was quite large. Rather than 26 references per node (104 bytes in my implementation), I had a one-byte count, and then five bytes per reference. So as long as there were fewer than 21 children for a particular node (which was most of the time), I saved space. There was a small runtime penalty, but not enough in my application to matter.

That's the only modification I had to make to my trie structure to make it support all of the alphabetic languages I was working with. As I said, I was working mostly with Western European languages, and for those it worked beautifully. I know that it did work with Hebrew and Arabic, but I don't know how well it worked. It met our purposes, but whether it would have satisfied a native speaker is unknown.

The trie I built worked well enough for our purposes with any language whose characters fit within the Unicode Basic Multilingual Plane. There was a little weirdness when working with surrogate pairs, but we pretty much ignored those. Basically, we just treated the surrogate pair as two characters and let it go at that.

You do have to decide whether you want to treat accented characters as separate characters, or if you want to map them. Consider, for example, the French word "garçon," which some people will spell "garcon," either because they don't know any better or they don't know how to make the character 'ç'. Depending on what you're using the trie for, you might find it useful to convert accented characters to their un-accented equivalents. But I suppose that's more of an input cleansing problem than a trie problem.

That's my fairly long-winded way of saying that a standard trie should work well for any alphabetic language, without any language-specific modifications. I don't see any obvious way to use a trie for a logographic language. I know nothing about Korean Hangul, so I can't say whether a trie would be useful there.

186

answered Oct 11 '22 06:10

Jim Mischel

As an addendum to @JimMischel's answer, I'd like to bring up the issue that in other languages there are often multiple equivalent ways to write the same thing. Vietnamese (based on the Latin/English script) is a particularly good example where letters with two accents are common. For example, Ặ (U+1EB6) can technically also be written with the sequences Ă + dot, Ạ + breve, A + breve + dot, A + dot + breve.

Unicode normalization can solve this problem by converting a string to a standardized canonical order. There are 4 different variations, NFC, NFKC, NFD, and NFKD. I won't go into too much detail here, but the first two are "composed forms" which tends to shorten the string, grouping base characters with its accents, while the last two are "decomposed forms", doing the opposite.

Hangul is an interesting case: It is an alphabet, though all the letters of a syllable are written together in a block. Both the individual letters and the syllabic blocks exist in Unicode. Normalization can solve this, though the number of distinct syllables is quite large. Using NFC/NFKC might not be useful for a trie, but in this case, using NFD/NFKD to decompose syllables to the constituent letters would work.

A few other unrelated points to consider:

In addition to the garçon/garcon point already brought up, you have the cote/coté/côte/côté problem, which are all distinct French words. Similarly, vowel marks in Hebrew and Arabic are not usually mandatory, which can occasionally cause ambiguities.
The alphabets¹ of South and Southeast Asia can get large compared to English, roughly twice the size.

They are strictly termed abugidas, where vowels are written as diacritics/accents, but this distinction can usually be ignored from a programming point of view.

answered Oct 11 '22 06:10

DPenner1

Related questions
                            
                                Encode string representation of integer to base64 in Python 3 [duplicate]
                            
                                Checking if a string contains a particular substring in Velocity
                            
                                How is concatenation of final strings done in Java?
                            
                                how std::u8string will be different from std::string?
                            
                                Ruby string strip defined characters
                            
                                Understanding error "terminate called after throwing an instance of 'std::length_error' what(): basic_string::_S_create Aborted (core dumped)"
                            
                                Is a std::string_view literal guaranteed to be null-terminated?
                            
                                Why strings are shown partially in the Visual Studio 2008 debugger?
                            
                                Time complexity of creating hash value of a string in hashtable
                            
                                Which string manipulation functions should I use?
                            
                                generating word cloud for items in a list in python
                            
                                Which string classes to use in C++?
                            
                                How does Java strings being immutable increase security?
                            
                                Excel VBA - delete string content up to *word*
                            
                                Writing hex data into a file
                            
                                extracting original regex pattern from std::regex
                            
                                How can I slice each element of a numpy array of strings?
                            
                                ActiveSupport Inflection library for Past Tense in Ruby?
                            
                                How-to write a password-safe class?
                            
                                strcmp vs. == vs. === in PHP for checking hash equality

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Limitations of and alternatives to tries in languages other than English?

Tags:

string

data-structures

internationalization

trie

templatetypedef

People also ask

2 Answers

Jim Mischel

DPenner1

Recent Activity

Donate For Us