Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating own POS Tagger

I have found the Stanford POS Tagger pretty good, but somehow I found myself in need of creating my own POS tagger.

For the last two weeks, I am rambling here and there, on whether to start from parsing tree, or once we have a pos tagger than we can parse tree, using ugly CFGs and NFAs so that they can help me in creating a POS tagger and what not.

I am ending the question here, asking seniors, where to begin POS tagging. (language of choice is Python, but C and JAVA won't hurt).

like image 501
akshayb Avatar asked Apr 10 '13 09:04

akshayb


People also ask

How do you make a POS tagger?

You will need a lot of samples already labeled with POS tags. Then you can use the samples to train a RNN. The x input to the RNN will be the sequence of tokens (words) and the y output will be the POS tags. The RNN, once trained, can be used as a POS tagger.

What is NLTK POS tagger?

POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.

What is JJ in POS tagging?

IN preposition/subordinating conjunction. JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest'

What is POS tagging in NLP?

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.


1 Answers

It depends on what your ultimate goal is.

If the goal is to perform syntax analysis, i.e. to determine the subject, the predicate, its arguments, its modifiers etc., and then to possibly even perform a semantic analysis, then you should not worry about the POS tagger. Instead you should first look at the various methods for syntactic analysis – in particular, phrase structure based methods, probabilistic methods, and finite-state methods – and determine the tool you want to use for that. The decision will depend on what your speed and accuracy requirements are, how much time you will have for long-term improvement and maintenance, and other factors. Once you've decided on the right tool (or implementation strategy) for that, you may end up not needing a tagger any more. The reason is that many syntax analysis strategies fundamentally don't need a tagger: They only perform a dictionary lookup, which returns, for each word, one or more possible POS tags; the disambiguation (i.e. deciding which of these tags is actually correct) is performed implicitly by the syntax analyser. Some syntax analysers may expect that you apply a POS tagger after the dictionary lookup, but they will also tell you which one to use, so the answer to your question will then follow quite naturally.

If, on the other hand, your goal does not require a full-fledged syntax analysis, but only part-of-speech tagging, I'd suggest that you first look at existing alternatives before deciding to make your own one. Possible choices include, but are not limited to:

  • Stanford Tagger
  • Mallet Simple Tagger
  • ANNIE tagger (This is a Brill-style tagger embedded into a larger NLP framework)
  • TreeTagger
  • Brill tagger
  • SENNA tagger

Which one is right for your needs depends on a number of factors, not necessarily in this prioritized order:

  1. Opaqueness: Do you intend to make corrections to improve the results, possibly by maintaining exception lists and post-correction rules, possibly over a long period of time? In this case, you may need a tagger that is not only open-source, but uses a methodology that enables manual modifications to the disambiguation strategy it uses. This is easier in a rule-based or TBL tagger (such as the Brill tagger), and to some extent taggers based on decision-tree learning (such as the TreeTagger); it is more difficult and possibilities are more limited in Hidden-Markov-Model based (HMM) taggers and taggers based on conditional random fields (CRF) (such as the Mallet Simple Tagger), and very difficult (except for pure post-correction exception lists) in taggers based on neural networks (such as SENNA).

  2. Target language: Do you need it just for English, or other languages as well? The TreeTagger has out-of-the-box support for many European languages, but the others in the list above don't. Adding support for a language will always require a dictionary, it will usually require an annotated training corpus (which may be expensive), and it will sometimes require that you write or modify a few hundred initial rules (e.g. if a Brill-tagger approach is used).

  3. Framework and programming language: Mallet and Stanford are for Java, the TreeTagger is in C (but not open-source; there are Python and Java wrappers, but they may cause significant slow-down and have other issues(‡)), SENNA is in C and open-source, ANNIE is in Java and made for the GATE framework, and so on. There are differences in the environment these taggers require, and moving them out of their natural habitat can be painful. NLTK (Python) has wrappers for some of them, but they typically don't involve an actual embedding of the source into Python; instead they simply perform a system call for each piece of text you want to tag. This has severe performance implications.

  4. Speed: If you only process a few sentences per second, any tagger will be able to handle that. But if you are dealing with terabytes of data or need to cope with extreme peaks in usage, you need to perform the right kind of stress tests as part of your evaluation and decision making. I know from personal experience that TreeTagger and SENNA are very fast, Stanford is quite a bit slower, and NLTK-wrappers are often several orders of magnitude slower. In any case, you need to test. Note that POS tagging can be parallized in a straight-forward way by dividing the input into partitions and running several tagging processes in parallel. Memory footprint is usually not an issue for the tagger itself (but it can be if the tagger is part of a general NLP framework that is loaded into memory completely).

Finally, if none of the existing taggers meets your needs and you really decide to create your own tagger, you'll still need to make a decision similar to the above: The right approach depends on accuracy, speed, maintenance and multi-linguality related factors. The main approaches to POS tagging are quite well-represented by the list of examples above, i.e. Rule/TBL-style (Brill), HMM/CRF (Mallet), entropy-based (Stanford), decision-tree learning (TreeTagger), neural network (SENNA). Even if you decide to make your own, it's a good idea to study some of the existing ones to understand how they operate and where the problems are.

As a final remark on multi-linguality: Classic POS-taggers such as the above require that you tokenize the input before you apply the tagger (or they implicitly perform a simple tokenization). This won't work with languages that cannot be tokenized using punctuation and white-space as token boundaries, i.e. Chinese, Japanese, Thai, to some extent Korean, and a few other languages. For those, you'll need to use a specialised tokenizer, and those tokenizers usually perform both tokenization and POS-tagging in one step.


(‡) I don't know about the Java wrapper, but the Python wrapper had several problems the last time I checked (approx. 1 year ago): It only worked with Python 2, it used system calls in a fairly complicated way, which was necessary to ensure that the Tree Tagger flushed its buffers after each input is processed. The latter has two consequences: The processing is slower than when using the Tree Tagger directly, and not the full pipeline of command-line tools can be used for some languages, because the buffer-flushing gets too complicated then.

like image 146
jogojapan Avatar answered Oct 03 '22 09:10

jogojapan