Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increasing performance on text processing

I have written a program that indicates all instances of a desired wordclass in a text. This is how I do it:

  • Make an array of words from the entire text

  • Iterate this array. For each word, look what its first letter is.

    • Jump to the corresponding array in an object of all words of the selected wordclass (e.g 'S') and iterate it. Break if the word is found and push it into an array of matches.
  • After all words are checked, iterate the array of matches and highlight each one in the text.

A text which consists of 240000 words is processed in 100 seconds regarding nouns and about 4.5 seconds regarding prepositions on my machine.

I am looking for a way to improve performance and those are the ideas I could come up with:

  • Rearrange the items in each block of my wordlist. Sort them in a way that if the word starts with a vocal, all items that have a consonant as its second character come first and vice versa. (in the assuming that words with double vocals or consonants are rare)
  • Structure the text into chapters and process only the currently shown chapter.

Are those solid ideas and are there any more ideas or proven techniques to improve this kind of processing?

like image 361
Wottensprels Avatar asked Mar 20 '15 15:03

Wottensprels


1 Answers

Use the power of javascript.

It manipulates dictionaries with string keys as a fundamental operation. For each word class, build an object with each possible word being a key and some simple value like true or 1. Then checking each word is simply typeof(wordClass[word]) !== "undefined". I expect this to be much much faster.

Regular expressions are another highly optimized area of Javascript. You can probably do the whole thing as one massive regular expression for each word class. If your highlighting is in HTML, then you can also just use a replace on the RE to get the result. This working is likely dependent on just how big your word sets are.

like image 173
DrC Avatar answered Oct 07 '22 21:10

DrC