I have two classes of sentences. Each has reasonably distinct pos-tag sequence. How can I train a Naive-Bayes classifier with POS-Tag sequence as a feature? Does Stanford CoreNLP/NLTK (Java or Python) provide any method for building a classifier with pos-tag as a feature? I know in python NaiveBayesClassifier
allows for building a NB classifier but it uses contains-a-word
as feature but can it be extended to use pos-tag-sequence as a feature ?
Step 1: Calculate the prior probability for given class labels. Step 2: Find Likelihood probability with each attribute for each class. Step 3: Put these value in Bayes Formula and calculate posterior probability. Step 4: See which class has a higher probability, given the input belongs to the higher probability class.
It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Use of HMM for POS Tagging The POS tagging process is the process of finding the sequence of tags which is most likely to have generated a given word sequence. We can model this POS process by using a Hidden Markov Model (HMM), where tags are the hidden states that produced the observable output, i.e., the words.
A CRF is a sequence modeling algorithm which is used to identify entities or patterns in text, such as POS tags. This model not only assumes that features are dependent on each other, but also considers future observations while learning a pattern.
If you know how to train and predict texts (or sentences in your case) using nltk's naive bayes classifier and words as features, than you can easily extend this approach in order to classify texts by pos-tags. This is because the classifier don't care about whether your feature-strings are words or tags. So you can simply replace the words of your sentences by pos-tags using for example nltk's standard pos tagger:
sent = ['So', 'they', 'have', 'internet', 'on', 'computers' , 'now']
tags = [t for w, t in nltk.pos_tag(sent)]
print tags
['IN', 'PRP', 'VBP', 'JJ', 'IN', 'NNS', 'RB']
As from now you can proceed with the "contains-a-word" approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With