The trie data structure is often a great way to store strings in English. It works by building a tree where each edge is labeled with a letter, and the path to a marked node in the tree spells out one of the words in the data structure.
This data structure works well in English because there are "only" 26 letters in the English alphabet (a "reasonable" branching factor), those characters have consecutive ASCII values (so the child pointers can be stored in an array keyed by the index of the letters used by each child), and there are many English words with common prefixes (so there is a lot of redundancy in the structure).
I'm a native English speaker with only limited knowledge of other languages and alphabets, but it seems like many of these properties don't hold in other languages. I know that French, Spanish, German, and Hungarian, for example, frequently use accented characters that aren't stored continuously with the remaining letters in Unicode space. Hebrew and Arabic have vowel markings that are usually indicated above or below each letter. Chinese uses a logogram system, and Korean Hangul characters consist of triples of smaller characters grouped together.
Do tries still work well for data stored in these languages and alphabets? What changes, if any, are necessary to use tries for this sort of data? Are there any data structures that work well for strings in those languages and alphabets that are particularly well-suited to them but would not be useful or efficient in English?
Tries have many advantages over their counterpart, the hash table. They are used in many string search applications such as auto-complete, text search, and prefix matching. Radix trees, a kind of trie, are often used in IP routing.
Unlike a binary search tree, nodes in the trie do not store their associated key. Instead, a node's position in the trie defines the key with which it is associated. This distributes the value of each key across the data structure, and means that not every node necessarily has an associated value.
Space complexity: O(m) (In the worst case newly inserted key doesn't share a prefix with the the keys already inserted in the trie. We have to add m new nodes, which takes us O(m) space.)
Tries are an extremely special and useful data-structure that are based on the prefix of a string. They are used to represent the “Retrieval” of data and thus the name Trie. A Trie is a special data structure used to store strings that can be visualized like a graph. It consists of nodes and edges.
I've found that tries work well for Western European languages, as well as for Cyrillic and many other alphabetic languages. Come to think of it, the only languages I had trouble with were Chinese, Japanese, and other logographic writing systems. And for those, the trie was useless.
The sequential Unicode values of English characters isn't really a huge benefit. Although it suggests the simple node implementation:
CharNode
char
array[26] of CharNode
That structure isn't particularly helpful. It can make things faster, but at a fairly high memory cost. Even at the second level of a trie, that array is remarkably sparse. By the time you get to the fourth or fifth level, it's almost all dead space. I did an analysis of that at one point. I'll look around and see if I still have the numbers.
I found it nearly as fast to have a variable-length array in the node, with items ordered by frequency. Beyond the second or third level of the trie, the character I was looking for was almost always in the first or second position in that array. And the space savings was quite large. Rather than 26 references per node (104 bytes in my implementation), I had a one-byte count, and then five bytes per reference. So as long as there were fewer than 21 children for a particular node (which was most of the time), I saved space. There was a small runtime penalty, but not enough in my application to matter.
That's the only modification I had to make to my trie structure to make it support all of the alphabetic languages I was working with. As I said, I was working mostly with Western European languages, and for those it worked beautifully. I know that it did work with Hebrew and Arabic, but I don't know how well it worked. It met our purposes, but whether it would have satisfied a native speaker is unknown.
The trie I built worked well enough for our purposes with any language whose characters fit within the Unicode Basic Multilingual Plane. There was a little weirdness when working with surrogate pairs, but we pretty much ignored those. Basically, we just treated the surrogate pair as two characters and let it go at that.
You do have to decide whether you want to treat accented characters as separate characters, or if you want to map them. Consider, for example, the French word "garçon," which some people will spell "garcon," either because they don't know any better or they don't know how to make the character 'ç'. Depending on what you're using the trie for, you might find it useful to convert accented characters to their un-accented equivalents. But I suppose that's more of an input cleansing problem than a trie problem.
That's my fairly long-winded way of saying that a standard trie should work well for any alphabetic language, without any language-specific modifications. I don't see any obvious way to use a trie for a logographic language. I know nothing about Korean Hangul, so I can't say whether a trie would be useful there.
As an addendum to @JimMischel's answer, I'd like to bring up the issue that in other languages there are often multiple equivalent ways to write the same thing. Vietnamese (based on the Latin/English script) is a particularly good example where letters with two accents are common. For example, Ặ (U+1EB6) can technically also be written with the sequences Ă + dot, Ạ + breve, A + breve + dot, A + dot + breve.
Unicode normalization can solve this problem by converting a string to a standardized canonical order. There are 4 different variations, NFC, NFKC, NFD, and NFKD. I won't go into too much detail here, but the first two are "composed forms" which tends to shorten the string, grouping base characters with its accents, while the last two are "decomposed forms", doing the opposite.
Hangul is an interesting case: It is an alphabet, though all the letters of a syllable are written together in a block. Both the individual letters and the syllabic blocks exist in Unicode. Normalization can solve this, though the number of distinct syllables is quite large. Using NFC/NFKC might not be useful for a trie, but in this case, using NFD/NFKD to decompose syllables to the constituent letters would work.
A few other unrelated points to consider:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With