So I've been working on a natural language processing project in which I need to classify different styles of writing. Assuming that semantic features from texts have already been extracted for me, I plan to use Weka in Java to train SVM classifiers using these features that can be used to classify other different texts.
The part I'm having trouble on is that to train an SVM, the features must be converted into a feature vector. I'm not sure how you would be able to represent features such as vocabulary richness, n-grams, punctuation, number of paragraphs, and paragraph length as numbers in a vector. If somebody could point in the right direction, that would be greatly appreciated.
I'm not sure what values your attributes can take on, but perhaps this example will help you:
Suppose we are conducting a supervised learning experiment to try to determine if a period marks the end of a sentence or not, EOS
and NEOS
respectively. The training data came from normal sentences in a paragraph style format, but were transformed to the following vector model:
Of course, this is not a very complicated problem to solve, but it's a nice little introduction to Weka. We can't just use the words as features (really high dimensional space), but we can take their POS (part of speech) tags. We can also extract the length of words, whether or not the word was capitalized, etc.
So, you could feed anything as testing data, so long as you're able to transform it into the vector model above and extract the features used in the .arff.
The following (very small portion of) .arff file was used for determining whether a period in a sentence marked the end of or not:
@relation period
@attribute minus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_three_length real
@attribute minus_three_case {'UC','LC','NA'}
@attribute minus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_two_length real
@attribute minus_two_case {'UC','LC','NA'}
@attribute minus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_one_length real
@attribute minus_one_case {'UC','LC','NA'}
@attribute plus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_one_length real
@attribute plus_one_case {'UC','LC','NA'}
@attribute plus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_two_length real
@attribute plus_two_case {'UC','LC','NA'}
@attribute plus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_three_length real
@attribute plus_three_case {'UC','LC','NA'}
@attribute left_before_reliable real
@attribute right_before_reliable real
@attribute spaces_follow_period real
@attribute class {'EOS','NEOS'}
@data
VBP, 2, LC,NP, 4, UC,NN, 1, UC,NP, 6, UC,NEND, 1, NA,NN, 7, LC,31,47,1,NEOS
NNS, 10, LC,RBR, 4, LC,VBN, 5, LC,?, 3, NA,NP, 6, UC,NP, 6, UC,93,0,0,EOS
VBD, 4, LC,RB, 2, LC,RP, 4, LC,CC, 3, UC,UH, 5, LC,VBP, 2, LC,19,17,2,EOS
As you can see, each attribute can take on whatever you want it to:
real
denotes a real number LC
and UC
to denote upper case and lower case, respectivelyPOS
tagsYou need to figure out exactly what your features are, and what values you'll use to represent/classify them. Then, you need to transform your data into the format defined by your .arff.
To touch on your punctuation question, let's suppose that we have sentences that all end in .
or ?
. You can have an attribute called punc, which takes two values:
@attribute punc {'p','q'}
I didn't use ?
because that is what is (conventionally) assigned when a data point is missing. Our you could have boolean attributes that indicates whether a character or what have you was present (with 0, 1 or false, true). Another example, but for quality:
@attribute quality {'great','good', 'poor'}
How you determine said classification is up to you, but the above should get you started. Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With