I can't seem to find an answer to my exact problem. Can anyone help?
A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").
I did bag-of-words on the text:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])
My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :
feature_names = vectorizer.get_feature_names()
feature_names
Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]
Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:
chi_matrix.feature_names[selector.get_support(indices=True)].tolist()
Or
chi_matrix.feature_names[features.get_support()]
These gives me an error: feature_names not found. What am I missing?
A
After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.
Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1])
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show
Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])
y = df['AboveAverage']
# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)
# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)),
columns=['ftr', 'score', 'pval'])
chi2_scores
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With