I am using sklearn's tfidf-vectorizer to create a document-feature matrix and list of feature terms.
I do not want repeating n-1 and n-2 grams, if an n-gram already exists. I.e., for an example sentence: The quick brown fox jumps over the fence.
I want to not include terms 'fox' and 'brown fox' if 'quick brown fox' exists.
My hypothesis is that repeating tokens causes an artificial expansion of the feature set and distorts results of other tasks such as clustering.
I know it's not an efficient way to do this but this is what i did. Using pandas series at the end just to subset the array with the selected indices.
def removeSubgrams(features):
# Sort features based on length of the n-gram
features = sorted(features , key=lambda x:len(x.split(" ")))
to_remove = []
# Iterate over all features
for i,subfeature in enumerate(features):
for j,longerfeature in enumerate(features[i+1:]):
if longerfeature.find(subfeature) > -1:
to_remove.append(i)
# break if subfeature is a substring of longerfeature
break
features = pd.Series(features)
# keep only those features that are not in to_remove
features = features.loc[~features.index.isin(to_remove)]
return features
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With