I have a variety of blacklisted terms I want identified within a corpus of text paragraphs. Each term is around 1 - 5 words long and contains certain keywords I do not want in my corpus of documents. If a term or something similar to it is identified in the corpus I want it removed from my corpus.
Removal aside, I am struggling with accurately identifying these terms in my corpus. I am using scikit-learn and have tried two seperate approaches:
A MultinomialNB classification approach using tf-idf vector features with a mix of blacklisted terms and clean terms used as training data.
A OneClassSVM approach where just the blacklisted keywords are used as training data and any text passed in that does not seem to resemble the blacklisted terms are considered outliers.
Here is the code for my OnceClassSVm approach:
df = pd.read_csv("keyword_training_blacklist.csv")
keywords_list = df['Keyword']
pipeline = Pipeline([
('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))),
# strings to token integer counts
('tfidf', TfidfTransformer(use_idf=False, norm='l2')), # integer counts to weighted TF-IDF scores
('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)), # train on TF-IDF vectors w/ Naive Bayes classifier
])
kf = KFold(len(keywords_list), 8)
for train_index, test_index in kf:
# make training and testing datasets
X_train, X_test = keywords_list[train_index], keywords_list[test_index]
pipeline.fit(X_train) # Train classifier using training data and labels
predicted = pipeline.predict(X_test)
print(predicted[predicted == 1].size / predicted.size)
csv_df = pd.read_csv("corpus.csv")
testCorpus = csv_df['Terms']
testCorpus = testCorpus.drop_duplicates()
for s in testCorpus:
if pipeline.predict([s])[0] == 1:
print(s)
In practice, I am getting many false positives when I try to pass in my corpus to the algorithm. My blacklisted term training data stands at around 3000 terms. Does the size of my training data need to be increased further or am I missing something obvious?
Try to use difflib
to identify closest match in the corpus to each of your black listed terms.
import difflib
from nltk.util import ngrams
words = corpus.split(' ') # split corpus to words based on spaces ( can be improved )
words_ngrams = [] # ngrams from 1 to 5 words
for n in range(1,6):
words_ngrams.extend( ' '.join(ngrams(words, n)) )
to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus.
sim_rate = 0.8 # similarity rate
max_matches = 4 # maximum number of matches for each term
for term in terms:
matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate)
for match in matches:
to_delete.append( (corpus.index(match), len(match) ) )
You can also use difflib.SequenceMatcher
if you want to get a score of similarity between terms and ngrams.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With