Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Text Preprocessing

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:

get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing

In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).

I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.

Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.

Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.

Is it possible to make it faster?

Edit:

What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.

like image 427
Ventus Avatar asked Jul 29 '10 17:07

Ventus


3 Answers

Iteratively replacing words is going to be the biggest bottleneck in your implementation. On each iteration, you have to scan the entire string for the stopword, then the replace operation has to allocate a new string and populate it with the text post-replacement. That's not going to be fast.

A much more efficient approach is to tokenize the string and perform replacement in a streaming fashion. Divide the input into individual words separated by whatever whitespace or separator characters are appropriate. You can do this incrementally so you don't need to allocate any additional memory to do so. For each word (token), you can now perform a lookup in a hashset of stopwords - if you find match, you will replace it as you stream out the final text to a separate StringBuilder. If the token is not a stopword, just stream it out the StringBuilder unmodified. This approach should have O(n) performance, as it only scans the string once and uses a HashSet to perform stopword lookup.

Below is one approach that I would expect to perform better. While it isn't fully streaming (it uses String.Split() which allocated an array of additional strings), it does all of the processing in a single pass. Refining the code to avoid allocating additional string is probably not going to provide much of an improvement, since you still need to extract out substrings to perform comparisons to your stopwords.

Code below returns a list of words that excludes all stopwords and words two letters or shorter form the result. It uses case-insensitive comparison on the stopwords as well.

public IEnumerable<string> SplitIntoWords( string input,
                                           IEnumerable<string> stopwords )
{
    // use case-insensitive comparison when matching stopwords
    var comparer = StringComparer.InvariantCultureIgnoreCase;
    var stopwordsSet = new HashSet<string>( stopwords, comparer );
    var splitOn = new char[] { ' ', '\t', '\r' ,'\n' };

    // if your splitting is more complicated, you could use RegEx instead...
    // if this becomes a bottleneck, you could use loop over the string using
    // string.IndexOf() - but you would still need to allocate an extra string
    // to perform comparison, so it's unclear if that would be better or not
    var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries );

    // return all words longer than 2 letters that are not stopwords...
    return words.Where( w => !stopwordsSet.Contains( w ) && w.Length > 2 );
}
like image 66
LBushkin Avatar answered Nov 15 '22 17:11

LBushkin


Instead of having the regex replace in the loop, why not dynamically construct a monster matching regex that matches any one of your stopwords, and then run one replace, replacing it with nothing? Something like "\b(what|ok|yeah)\b" if your stopwords are "what", "ok", and "yeah". That seems like it would probably be more efficient.

like image 2
mqp Avatar answered Nov 15 '22 18:11

mqp


OK, I know that SO is not a pure forum and maybe I shouldn't answer my own question but I'd like to share with my results.

Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley's answer):

[0-9]|[^\w]|(\b\w{1,2}\b)

It does the same as first one but is very simple. Then following Josh Kelley's suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).

So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.

Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).

One big Regular Expression

with all stop words and compiled into assembly (mquander's suggestion). Nothing to clear here.

  • Execution time: ~215ms

String.Replace()

People say that this can be faster than Regex. So for each stop word I used string.Replace() method. Many loops to take with result:

  • Execution time: ~65ms

LINQ

method presented by LBushkin. Nothing to say more.

  • Execution time: ~2.5ms

I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!

like image 2
Ventus Avatar answered Nov 15 '22 19:11

Ventus