Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple Natural Language Processing Startup for Java [duplicate]

Tags:

java

nlp

I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.

Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-

How to install / set up a nlp system which can-

  1. parse a English sentence words
  2. identify the different parts of speech
like image 507
shababhsiddique Avatar asked Apr 29 '11 13:04

shababhsiddique


People also ask

Can I use Java for NLP?

Java can be applied to a wide range of processes in machine learning and data science, including data export and import, data cleaning, deep learning, statistical analysis, NLP, ML, and data visualization.

What is NPL in Java?

It is a machine learning-based toolkit for processing natural language text. It consists of a set of components including a sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, and a parser that allows Java developers to build a complete NLP pipeline.

What is NLP in Java?

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.


2 Answers

You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.

If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).

The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.

If you do decide you need parse structure, a few recommendations:

I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.

The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).

The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.

Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.

And a couple Wikipedia references for background if needed:

  • Context-free grammars: http://en.wikipedia.org/wiki/Stochastic_context-free_grammar

  • Dependency grammars: http://en.wikipedia.org/wiki/Dependency_grammar

like image 50
AaronD Avatar answered Oct 16 '22 16:10

AaronD


Generally you'd do these two tasks in the other order:

  1. Do part-of-speech tagging
  2. Run a parser using the POS tags as input

OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.

You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.

LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!

like image 24
aab Avatar answered Oct 16 '22 17:10

aab