I was going to try using fuzzywuzzy with a tuned acceptable score parameter essentially it would check if the word is in the vocabulary as-is, and if not, it would ask fuzzywuzzy to choose the best fuzzy match, and accept that for the list of tokens if it was at least a certain score.
If this isn't the best approach to deal with a fair amount of typos and slightly differently spelled but similar words, i'm open to suggestion.
The problem is that the subclass keeps complaining that it has an empty vocabulary, which doesn't make any sense, as when I use a regular count vectorizer in the same part of my code it works fine.
it spits many errors like this: ValueError: empty vocabulary; perhaps the documents only contain stop words
What am I missing? I don't have it doing anything special yet. It ought to work like normal:
class FuzzyCountVectorizer(CountVectorizer):
def __init__(self, input='content', encoding='utf-8', decode_error='strict',
strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,
token_pattern="(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word',
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
dtype=numpy.int64, min_fuzzy_score=80):
super().__init__(
input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents,
lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, stop_words=stop_words,
token_pattern=token_pattern, ngram_range=ngram_range, analyzer=analyzer, max_df=max_df,
min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)
# self._trained = False
self.min_fuzzy_score = min_fuzzy_score
@staticmethod
def remove_non_alphanumeric_chars(s: str) -> 'str':
pass
@staticmethod
def tokenize_text(s: str) -> 'List[str]':
pass
def fuzzy_repair(self, sl: 'List[str]') -> 'List[str]':
pass
def fit(self, raw_documents, y=None):
print('Running FuzzyTokenizer Fit')
#TODO clean up input
super().fit(raw_documents=raw_documents, y=y)
self._trained = True
return self
def transform(self, raw_documents):
print('Running Transform')
#TODO clean up input
#TODO fuzzyrepair
return super().transform(raw_documents=raw_documents)
The original function definition for scikit-learn's CountVectorizer
has
token_pattern=r"(?u)\b\w\w+\b"
while in your subclass you don't use the escape r
string prefix, hence this issue. Also instead of copying all __init__
arguments, it might be easier to just use,
def __init__(self, *args, **kwargs):
self.min_fuzzy_score = kwargs.pop('min_fuzzy_score', 80)
super().__init__(*args, **kwargs)
As to whether this is the best approach, it depends on the size of your dataset. For document set with a total of N_words
and a vocabulary size of N_vocab_size
this approach would require O(N_words*N_vocab_size)
fussy word comparisons. While if you vectorized your dataset with the standard CountVectorizer
then reduced the computed vocabulary (and bag od words matrix) by fuzzy matching, it would require "only" O(N_vocab_size**2)
comparisons.
This would probably still not scale well for a vocabulary beyond a few 10,000 words. If you intend to apply some machine learning algorithm on the resulting sparse array, you might also want to try character n-grams which would also be somewhat robust with respect to typographical errors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With