Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there APIs for text analysis/mining in Java? [closed]

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Are there APIs for text analysis in Java?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

like image 636
Renato Dinhani Avatar asked Jul 23 '11 12:07

Renato Dinhani


People also ask

What are the challenges in text analysis?

Basically, the challenge in text analysis is decoding the ambiguity of human language, while in text analytics it's detecting patterns and trends from the numerical results.

Which is an online site using software to mine text data?

The 7 best online text mining tools to analyze text like a pro: MonkeyLearn. Google Cloud NLP. IBM Watson.

What is automated text analysis?

Automated or computer-assisted text analysisText analysis describes a family of methods for parsing, classifying, and then quantifying textual data for further statistical analysis.


3 Answers

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:

  • simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
  • part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.

In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

like image 141
William Niu Avatar answered Sep 20 '22 21:09

William Niu


For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

P.S. According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

enter image description here

like image 39
stemm Avatar answered Sep 21 '22 21:09

stemm


If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

like image 30
scott Avatar answered Sep 19 '22 21:09

scott