I am trying to implement a very, very large dictionary search to match words in a sentence in PHP. My initial idea was to use the Aho-corasick algorithm, as the Aho-corasick solves my exact problem. I first implemented a Trie in PHP. The Trie, when cached, makes a sufficiently fast dictionary; however, it takes up approximately 3mb of memory. This is not going to scale well in PHP.
Obviously, no matter the data structure we use, a large dictionary will take up a lot of memory. I only need a single instance of the dictionary, since it is static and will not need to be rebuilt.
If this object could be shared between all threads, 3mb of memory is negligible, however, I am unsure of the proper way to share memory between threads in PHP.
How can I share this object between HTTP requests? I cannot see the project scaling when each thread requires 3mb overhead created just by the Trie.
I wrote (forked from APC and maintain) APCu: A shared memory cache is not going to help you. Their internal storage area already has a defined structure, you can't change it. You can store your structure as objects, but these, and no other value, are actually shared between instances of PHP. Shared memory apc-like caches, copy out of shared memory for each context that requests the value.
I wrote pthreads (PHP extension): Threads are not going to help you. Just like APC having to copy out of shared memory, threads must.
PHP is shared nothing, all the time, or else you break stuff. You could write code that seemed as if it were sharing memory, but it wouldn't be; The rules must never be broken.
I don't think PHP a sensible target language if a primary requirement is efficiency, you seemed to recognize this by the end of your first paragraph. I can be wrong about that, but armed with all the facts above, I'd be surprised if you don't agree.
While it's not a sensible language, it's an arguably sensible platform. I'm going to assume that you want to use this in a web application context, and so are targeting PHP, but a much more sensible thing to do would be to implement the structures and algorithms in a suitable language, and expose it to your web application via an extension.
Suitable language generally means C or C++ for a PHP extension, but can mean others, if you are inventive enough.
You would still not be able to break the rules, but you wouldn't need to.
Obviously this relies on your ability to do those things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With