sklearn tfidf vectorizer - remove n-2 and n-1 grams if n gram exists

Question

I am using sklearn's tfidf-vectorizer to create a document-feature matrix and list of feature terms.

I do not want repeating n-1 and n-2 grams, if an n-gram already exists. I.e., for an example sentence: The quick brown fox jumps over the fence.

I want to not include terms 'fox' and 'brown fox' if 'quick brown fox' exists.

My hypothesis is that repeating tokens causes an artificial expansion of the feature set and distorts results of other tasks such as clustering.

apokhi · Accepted Answer

I know it's not an efficient way to do this but this is what i did. Using pandas series at the end just to subset the array with the selected indices.

def removeSubgrams(features):
  # Sort features based on length of the n-gram
  features = sorted(features , key=lambda x:len(x.split(" ")))

  to_remove = []

  # Iterate over all features
  for i,subfeature in enumerate(features):
    for j,longerfeature in enumerate(features[i+1:]):
      if longerfeature.find(subfeature) > -1:
        to_remove.append(i)
        # break if subfeature is a substring of longerfeature
        break
  features = pd.Series(features)
  # keep only those features that are not in to_remove
  features = features.loc[~features.index.isin(to_remove)]
  return features

sklearn tfidf vectorizer - remove n-2 and n-1 grams if n gram exists

Tags:

python

scikit-learn

n-gram

tfidfvectorizer

sheth7

1 Answers

apokhi

Recent Activity

Donate For Us

sklearn tfidf vectorizer - remove n-2 and n-1 grams if n gram exists

Tags:

python

scikit-learn

n-gram

tfidfvectorizer

sheth7

1 Answers

apokhi

Related questions

Recent Activity

Donate For Us