Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to accurately classify text with a lot of potential values using scikit?

I have a variety of blacklisted terms I want identified within a corpus of text paragraphs. Each term is around 1 - 5 words long and contains certain keywords I do not want in my corpus of documents. If a term or something similar to it is identified in the corpus I want it removed from my corpus.

Removal aside, I am struggling with accurately identifying these terms in my corpus. I am using scikit-learn and have tried two seperate approaches:

  1. A MultinomialNB classification approach using tf-idf vector features with a mix of blacklisted terms and clean terms used as training data.

  2. A OneClassSVM approach where just the blacklisted keywords are used as training data and any text passed in that does not seem to resemble the blacklisted terms are considered outliers.

Here is the code for my OnceClassSVm approach:

df = pd.read_csv("keyword_training_blacklist.csv")

keywords_list = df['Keyword']

pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))),
    # strings to token integer counts
    ('tfidf', TfidfTransformer(use_idf=False, norm='l2')),  # integer counts to weighted TF-IDF scores
    ('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

kf = KFold(len(keywords_list), 8)
for train_index, test_index in kf:
    # make training and testing datasets
    X_train, X_test = keywords_list[train_index], keywords_list[test_index]

    pipeline.fit(X_train)  # Train classifier using training data and labels
    predicted = pipeline.predict(X_test)
    print(predicted[predicted == 1].size / predicted.size)

csv_df = pd.read_csv("corpus.csv")

testCorpus = csv_df['Terms']

testCorpus = testCorpus.drop_duplicates()


for s in testCorpus:
    if pipeline.predict([s])[0] == 1:
        print(s)

In practice, I am getting many false positives when I try to pass in my corpus to the algorithm. My blacklisted term training data stands at around 3000 terms. Does the size of my training data need to be increased further or am I missing something obvious?

like image 879
GreenGodot Avatar asked Oct 31 '22 07:10

GreenGodot


1 Answers

Try to use difflib to identify closest match in the corpus to each of your black listed terms.

import difflib
from nltk.util import ngrams

words = corpus.split(' ') # split corpus to words based on spaces ( can be improved )

words_ngrams = [] # ngrams from 1 to 5 words
for n in range(1,6):
    words_ngrams.extend( ' '.join(ngrams(words, n)) )


to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus.
sim_rate = 0.8 # similarity rate
max_matches = 4 # maximum number of matches for each term
for term in terms:
    matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate)
    for match in matches:
        to_delete.append( (corpus.index(match), len(match) ) )

You can also use difflib.SequenceMatcher if you want to get a score of similarity between terms and ngrams.

like image 76
Ghilas BELHADJ Avatar answered Nov 15 '22 05:11

Ghilas BELHADJ