I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]
. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.
As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger
). Your responsibility is to select features that will give you good performance.
You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag()
will do a decent enough job (once you strip off markup like [KEYWORD ...]
), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:
Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...
The problem you are looking to solve is called most commonly, Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important notion is that you need to convert your text data into a suitable format for sequence taggers. Here is an example of the BIO format:
I O
love O
Paris B-LOC
and O
New B-LOC
York I-LOC
. O
From there, you can choose to train any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently the most popular algorithm for such multi-token sequence classification tasks is CRF. There are available tools that will let you train a BIO model (although originally intended for chunking) from a file using the format shown above (e.g. YamCha, CRF++, CRFSuite, Wapiti). If you are using Python you can look into scikit-learn, python-crfsuite and PyStruct in addition to NLTK.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With