Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Most efficient way for storing, loading and looking up a lexicon

I have a dictionary that consists of words and their phonetic transcriptions. The words are all lower case, so there is not case-sensitive search involved.

The lexicon is really huge, and I need to load it quickly when my application starts. I would prefer reading it without having to read each entry separately.

I guess the way I stored and load it also affects how I keep the lexicon in memory

Thank you for any ideas.

like image 305
tmighty Avatar asked May 21 '13 15:05

tmighty


People also ask

What is data dictionary in C language?

A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them.

How would you implement a dictionary?

A Dictionary (also known as Table or Map) can be implemented in various ways: using a list, binary search tree, hashtable, etc., etc.

Does C have a dictionary?

Section 6.6 of The C Programming Language presents a simple dictionary (hashtable) data structure. I don't think a useful dictionary implementation could get any simpler than this. For your convenience, I reproduce the code here. Note that if the hashes of two strings collide, it may lead to an O(n) lookup time.

Do we have dictionary in C++?

The dictionary type that is present in C++ is called map which acts like a container to store values that are indexed by keys. Each value in the dictionary also called a map is associated with a key.


2 Answers

You probably want to store this as a Trie

This is an efficient way of storing a dictionary. Look at the following answers for more information

http://en.wikipedia.org/wiki/Trie

https://stackoverflow.com/questions/296618/what-is-the-most-common-use-of-the-trie-data-structure

Persisting a trie to a file - C

like image 71
Salgar Avatar answered Sep 21 '22 13:09

Salgar


A few options come to mind:

  1. You could use sqlite, which uses mmap to map the file to memory, to store the lexicon so only what is accessed gets read. This is probably reasonable fast and reliable as well as the easiest to implement.
  2. You can mmap the file yourself
  3. Use seek operations to move the file pointer through the file without reading the whole thing. This will only help if the lexicon is structured in some way so you can find the right position without reading everything, i.e. it has to be a data structure that allows better than O(n) searching (a Trie usually being a good choice, as suggested by Salgar).
like image 20
smocking Avatar answered Sep 23 '22 13:09

smocking