I am new to Statistics.I am trying to select the best features to do classification on my data set and I chose to do so by running SelectKbest from scikitlearn.
Here is my code :
import sklearn.feature_selection as fs
kb = fs.SelectKBest(k=10)
kb.fit(X, y)
names = X.columns.values[kb.get_support()]
scores = kb.scores_[kb.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=
['Feat_names','F_Scores'])
ns_df_sorted = ns_df.sort_values(['F_Scores','Feat_names'], ascending =
[False, True])
print(ns_df_sorted)
This gives an output like this
Feat_names F_Scores
4 go_out 29.870218
8 fun1_2 27.374212
6 fun1_1 26.470766
3 date 25.035227
7 shar1_1 17.629153
2 imprace 11.331197
0 order 11.290014
5 sinc1_1 8.309805
9 shar1_2 5.009775
1 field_cd 4.515538
I am not sure what the F score here signifies and what I can interpret from it.
You can understand the F-Scores as a measure of how informative each feature is for your dataset.
As it is explained in the method documentation, an F-test is carried out to assess each feature. The F-scores are the test statistic for the F-test, and they basically represent the ratio between the explained and the unexplained variance.
So, in your example, after using the feature selection method you could either take all the k=10 most informative features or you could use the scores to refine more your selection (e.g. choosing only those for which the F-score is higher than some threshold).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With