Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn: Would like to extend CountVectorizer to fuzzy match against vocabulary

I was going to try using fuzzywuzzy with a tuned acceptable score parameter essentially it would check if the word is in the vocabulary as-is, and if not, it would ask fuzzywuzzy to choose the best fuzzy match, and accept that for the list of tokens if it was at least a certain score.

If this isn't the best approach to deal with a fair amount of typos and slightly differently spelled but similar words, i'm open to suggestion.

The problem is that the subclass keeps complaining that it has an empty vocabulary, which doesn't make any sense, as when I use a regular count vectorizer in the same part of my code it works fine.

it spits many errors like this: ValueError: empty vocabulary; perhaps the documents only contain stop words

What am I missing? I don't have it doing anything special yet. It ought to work like normal:

class FuzzyCountVectorizer(CountVectorizer):
    def __init__(self, input='content', encoding='utf-8', decode_error='strict',
                 strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,
                 token_pattern="(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word',
                 max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
                 dtype=numpy.int64, min_fuzzy_score=80):
        super().__init__(
            input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents,
            lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, stop_words=stop_words,
            token_pattern=token_pattern, ngram_range=ngram_range, analyzer=analyzer, max_df=max_df,
            min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)
        # self._trained = False
        self.min_fuzzy_score = min_fuzzy_score

    @staticmethod
    def remove_non_alphanumeric_chars(s: str) -> 'str':
        pass

    @staticmethod
    def tokenize_text(s: str) -> 'List[str]':
        pass

    def fuzzy_repair(self, sl: 'List[str]') -> 'List[str]':
        pass

    def fit(self, raw_documents, y=None):
        print('Running FuzzyTokenizer Fit')
        #TODO clean up input
        super().fit(raw_documents=raw_documents, y=y)
        self._trained = True
        return self

    def transform(self, raw_documents):
        print('Running Transform')
        #TODO clean up input
        #TODO fuzzyrepair
        return super().transform(raw_documents=raw_documents)
like image 495
KotoroShinoto Avatar asked Oct 18 '22 05:10

KotoroShinoto


1 Answers

The original function definition for scikit-learn's CountVectorizer has

token_pattern=r"(?u)\b\w\w+\b"

while in your subclass you don't use the escape r string prefix, hence this issue. Also instead of copying all __init__ arguments, it might be easier to just use,

def __init__(self, *args, **kwargs):
     self.min_fuzzy_score = kwargs.pop('min_fuzzy_score', 80)
     super().__init__(*args, **kwargs)

As to whether this is the best approach, it depends on the size of your dataset. For document set with a total of N_words and a vocabulary size of N_vocab_size this approach would require O(N_words*N_vocab_size) fussy word comparisons. While if you vectorized your dataset with the standard CountVectorizer then reduced the computed vocabulary (and bag od words matrix) by fuzzy matching, it would require "only" O(N_vocab_size**2) comparisons.

This would probably still not scale well for a vocabulary beyond a few 10,000 words. If you intend to apply some machine learning algorithm on the resulting sparse array, you might also want to try character n-grams which would also be somewhat robust with respect to typographical errors.

like image 117
rth Avatar answered Oct 21 '22 00:10

rth