Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language Processing - Converting Text Features Into Feature Vectors

So I've been working on a natural language processing project in which I need to classify different styles of writing. Assuming that semantic features from texts have already been extracted for me, I plan to use Weka in Java to train SVM classifiers using these features that can be used to classify other different texts.

The part I'm having trouble on is that to train an SVM, the features must be converted into a feature vector. I'm not sure how you would be able to represent features such as vocabulary richness, n-grams, punctuation, number of paragraphs, and paragraph length as numbers in a vector. If somebody could point in the right direction, that would be greatly appreciated.

like image 498
myrocks2 Avatar asked May 29 '13 20:05

myrocks2


1 Answers

I'm not sure what values your attributes can take on, but perhaps this example will help you:

Suppose we are conducting a supervised learning experiment to try to determine if a period marks the end of a sentence or not, EOS and NEOS respectively. The training data came from normal sentences in a paragraph style format, but were transformed to the following vector model:

  • Column 1: Class: End-of-Sentence or Not-End-of-Sentence
  • Columns 2-8: The +/- 3 words surrounding the period in question
  • Columns 9,10: The number of words to the left/right, respectively, of the period before the next reliable sentence delimiter (e.g. ?, ! or a paragraph marker).
  • Column 11: The number of spaces following the period.

Of course, this is not a very complicated problem to solve, but it's a nice little introduction to Weka. We can't just use the words as features (really high dimensional space), but we can take their POS (part of speech) tags. We can also extract the length of words, whether or not the word was capitalized, etc.

So, you could feed anything as testing data, so long as you're able to transform it into the vector model above and extract the features used in the .arff.

The following (very small portion of) .arff file was used for determining whether a period in a sentence marked the end of or not:

@relation period

@attribute minus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_three_length real
@attribute minus_three_case {'UC','LC','NA'}
@attribute minus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_two_length real
@attribute minus_two_case {'UC','LC','NA'}
@attribute minus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_one_length real
@attribute minus_one_case {'UC','LC','NA'}
@attribute plus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_one_length real
@attribute plus_one_case {'UC','LC','NA'}
@attribute plus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_two_length real
@attribute plus_two_case {'UC','LC','NA'}
@attribute plus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_three_length real
@attribute plus_three_case {'UC','LC','NA'}
@attribute left_before_reliable real
@attribute right_before_reliable real
@attribute spaces_follow_period real
@attribute class  {'EOS','NEOS'}

@data

VBP, 2, LC,NP, 4, UC,NN, 1, UC,NP, 6, UC,NEND, 1, NA,NN, 7, LC,31,47,1,NEOS
NNS, 10, LC,RBR, 4, LC,VBN, 5, LC,?, 3, NA,NP, 6, UC,NP, 6, UC,93,0,0,EOS
VBD, 4, LC,RB, 2, LC,RP, 4, LC,CC, 3, UC,UH, 5, LC,VBP, 2, LC,19,17,2,EOS

As you can see, each attribute can take on whatever you want it to:

  • real denotes a real number
  • I made up LC and UC to denote upper case and lower case, respectively
  • Most of the other values are POS tags

You need to figure out exactly what your features are, and what values you'll use to represent/classify them. Then, you need to transform your data into the format defined by your .arff.

To touch on your punctuation question, let's suppose that we have sentences that all end in . or ?. You can have an attribute called punc, which takes two values:

@attribute punc {'p','q'}

I didn't use ? because that is what is (conventionally) assigned when a data point is missing. Our you could have boolean attributes that indicates whether a character or what have you was present (with 0, 1 or false, true). Another example, but for quality:

@attribute quality {'great','good', 'poor'}

How you determine said classification is up to you, but the above should get you started. Good luck.

like image 152
Steve P. Avatar answered Oct 21 '22 16:10

Steve P.