Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java library for keywords extraction from input text [closed]

I'm looking for a Java library to extract keywords from a block of text.

The process should be as follows:

stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.

Is there a library that performs this task?

like image 699
Shay Avatar asked Jul 03 '13 11:07

Shay


People also ask

What is rake algorithm?

Rapid Automatic Keyword Extraction(RAKE) is a Domain-Independent keyword extraction algorithm in Natural Language Processing. 2. It is an Individual document-oriented dynamic Information retrieval method.

How does keyword extraction works?

Keyword extraction technique will sift through the whole set of data in minutes and obtain the words and phrases that best describe each subject. This way, you can easily identify which parts of the available data cover the subjects you are looking for while saving your teams many hours of manual processing.


1 Answers

Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).

Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).


The data model

One keyword for one stem. Different words may have the same stem, hence the terms set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).

public class Keyword implements Comparable<Keyword> {    private final String stem;   private final Set<String> terms = new HashSet<String>();   private int frequency = 0;    public Keyword(String stem) {     this.stem = stem;   }    public void add(String term) {     terms.add(term);     frequency++;   }    @Override   public int compareTo(Keyword o) {     // descending order     return Integer.valueOf(o.frequency).compareTo(frequency);   }    @Override   public boolean equals(Object obj) {     if (this == obj) {       return true;     } else if (!(obj instanceof Keyword)) {       return false;     } else {       return stem.equals(((Keyword) obj).stem);     }   }    @Override   public int hashCode() {     return Arrays.hashCode(new Object[] { stem });   }    public String getStem() {     return stem;   }    public Set<String> getTerms() {     return terms;   }    public int getFrequency() {     return frequency;   }  } 

Utilities

To stem a word:

public static String stem(String term) throws IOException {    TokenStream tokenStream = null;   try {      // tokenize     tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));     // stem     tokenStream = new PorterStemFilter(tokenStream);      // add each token in a set, so that duplicates are removed     Set<String> stems = new HashSet<String>();     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);     tokenStream.reset();     while (tokenStream.incrementToken()) {       stems.add(token.toString());     }      // if no stem or 2+ stems have been found, return null     if (stems.size() != 1) {       return null;     }     String stem = stems.iterator().next();     // if the stem has non-alphanumerical chars, return null     if (!stem.matches("[a-zA-Z0-9-]+")) {       return null;     }      return stem;    } finally {     if (tokenStream != null) {       tokenStream.close();     }   }  } 

To search into a collection (will be used by the list of potential keywords):

public static <T> T find(Collection<T> collection, T example) {   for (T element : collection) {     if (element.equals(example)) {       return element;     }   }   collection.add(example);   return example; } 

Core

Here is the main input method:

public static List<Keyword> guessFromString(String input) throws IOException {    TokenStream tokenStream = null;   try {      // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")     input = input.replaceAll("-+", "-0");     // replace any punctuation char but apostrophes and dashes by a space     input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");     // replace most common english contractions     input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");      // tokenize input     tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));     // to lowercase     tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);     // remove dots from acronyms (and "'s" but already done manually above)     tokenStream = new ClassicFilter(tokenStream);     // convert any char to ASCII     tokenStream = new ASCIIFoldingFilter(tokenStream);     // remove english stop words     tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());      List<Keyword> keywords = new LinkedList<Keyword>();     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);     tokenStream.reset();     while (tokenStream.incrementToken()) {       String term = token.toString();       // stem each term       String stem = stem(term);       if (stem != null) {         // create the keyword or get the existing one if any         Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));         // add its corresponding initial token         keyword.add(term.replaceAll("-0", "-"));       }     }      // reverse sort by frequency     Collections.sort(keywords);      return keywords;    } finally {     if (tokenStream != null) {       tokenStream.close();     }   }  } 

Example

Using the guessFromString method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:

java         x12    [java] compil       x5     [compiled, compiler, compilers] sun          x5     [sun] develop      x4     [developed, developers] languag      x3     [languages, language] implement    x3     [implementation, implementations] applic       x3     [application, applications] run          x3     [run] origin       x3     [originally, original] gnu          x3     [gnu] 

Iterate over the output list to know which were the original found words for each stem by getting the terms sets (displayed between brackets [...] in the above example).


What's next

Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)

like image 118
sp00m Avatar answered Oct 02 '22 12:10

sp00m