I have written a program that indicates all instances of a desired wordclass in a text. This is how I do it:
Make an array of words from the entire text
Iterate this array. For each word, look what its first letter is.
After all words are checked, iterate the array of matches and highlight each one in the text.
A text which consists of 240000 words is processed in 100 seconds regarding nouns and about 4.5 seconds regarding prepositions on my machine.
I am looking for a way to improve performance and those are the ideas I could come up with:
Are those solid ideas and are there any more ideas or proven techniques to improve this kind of processing?
Use the power of javascript.
It manipulates dictionaries with string keys as a fundamental operation. For each word class, build an object with each possible word being a key and some simple value like true or 1. Then checking each word is simply typeof(wordClass[word]) !== "undefined"
. I expect this to be much much faster.
Regular expressions are another highly optimized area of Javascript. You can probably do the whole thing as one massive regular expression for each word class. If your highlighting is in HTML, then you can also just use a replace on the RE to get the result. This working is likely dependent on just how big your word sets are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With