I have the POS tag sentences obtain using Stanford POS tagger. Eg:
The/DT island/NN was/VBD very/RB beautiful/JJ ./. I/PRP love/VBP it/PRP ./.
(xml format also available)
Can anyone explain how to perform feature selection from this POS tag sentences and convert them into feature vector for text classification using machine learning method.
A simple way to start out would be something like the following (assuming word order is not important for your classification algorithm).
First you would manually classify a number of sentences. This is your training dataset. Generally, the more sentences you manually classify from each class, the greater accuracy you will achieve. For a supervised approach like this, keep in mind that the only features being selected would be from your manually classified sentences. Your features are each unique combination of word/POS over all your training sentences.
Finally, you must choose a feature selection algorithm. There are many out there, but a popular one is chi-squared. Some others are Information Gain, Mutual Information, etc. Using chi-squared, you would measure the dependence of the class variable on each feature individually. You would pick some threshold, such as the top 10% of features with the lowest chi-squared value, and only keep those features to later use in your classifier.
The choice of feature selection algorithm is important, and needs to reflect the algorithm you are using. For example, chi-squared is good when you want to find features that both positively and negatively correlate to your class. In other circumstances, you might only want positively correlated features, so you would need to pick another algorithm or modify an existing one.
Hope that helps, William Riley-Land
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With