Are there APIs for text analysis/mining in Java? [closed]

Tags:

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Are there APIs for text analysis in Java?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

636

asked Jul 23 '11 12:07

Renato Dinhani

3 Answers

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:

simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.

In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

141

answered Sep 20 '22 21:09

William Niu

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

P.S. According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

enter image description here

answered Sep 21 '22 21:09

stemm

If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

answered Sep 19 '22 21:09

scott

Related questions
                            
                                Java: a variable "might have already been initialized", but I don't understand how
                            
                                How to get Apache CLI to handle double-dash?
                            
                                jTidy returns nothing after tidying HTML
                            
                                How to write separated Arabic letters?
                            
                                Right-click MouseListener on whole JTable component
                            
                                How to unit test a route with a bean which will access DB?
                            
                                How to design a crawl bot?
                            
                                How to check that the variable in a different class has been updated?
                            
                                An example of using LibSVM in java
                            
                                Are streams closed automatically on error?
                            
                                Format XML generated by Xstream
                            
                                Initialize Array in an array
                            
                                Image transformation results in a red image?
                            
                                Config file for holding connection string parameters in Java
                            
                                Java Swing: Mouseover text on JComboBox items?
                            
                                How to delete all documents in mongodb collection in java
                            
                                JsonPath JUnit escape character for dots
                            
                                Why Integer.getInteger does not work?
                            
                                Is there a SaxParser that reads json and fires events so it looks like xml
                            
                                JPA clear collection and add new items

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are there APIs for text analysis/mining in Java? [closed]

Tags:

java

nlp

text-mining

api

analysis