Java library for keywords extraction from input text [closed]

Tags:

I'm looking for a Java library to extract keywords from a block of text.

The process should be as follows:

stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.

Is there a library that performs this task?

699

asked Jul 03 '13 11:07

Shay

1 Answers

Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).

Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).

The data model

One keyword for one stem. Different words may have the same stem, hence the terms set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).

public class Keyword implements Comparable<Keyword> {    private final String stem;   private final Set<String> terms = new HashSet<String>();   private int frequency = 0;    public Keyword(String stem) {     this.stem = stem;   }    public void add(String term) {     terms.add(term);     frequency++;   }    @Override   public int compareTo(Keyword o) {     // descending order     return Integer.valueOf(o.frequency).compareTo(frequency);   }    @Override   public boolean equals(Object obj) {     if (this == obj) {       return true;     } else if (!(obj instanceof Keyword)) {       return false;     } else {       return stem.equals(((Keyword) obj).stem);     }   }    @Override   public int hashCode() {     return Arrays.hashCode(new Object[] { stem });   }    public String getStem() {     return stem;   }    public Set<String> getTerms() {     return terms;   }    public int getFrequency() {     return frequency;   }  }

Utilities

To stem a word:

public static String stem(String term) throws IOException {    TokenStream tokenStream = null;   try {      // tokenize     tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));     // stem     tokenStream = new PorterStemFilter(tokenStream);      // add each token in a set, so that duplicates are removed     Set<String> stems = new HashSet<String>();     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);     tokenStream.reset();     while (tokenStream.incrementToken()) {       stems.add(token.toString());     }      // if no stem or 2+ stems have been found, return null     if (stems.size() != 1) {       return null;     }     String stem = stems.iterator().next();     // if the stem has non-alphanumerical chars, return null     if (!stem.matches("[a-zA-Z0-9-]+")) {       return null;     }      return stem;    } finally {     if (tokenStream != null) {       tokenStream.close();     }   }  }

To search into a collection (will be used by the list of potential keywords):

public static <T> T find(Collection<T> collection, T example) {   for (T element : collection) {     if (element.equals(example)) {       return element;     }   }   collection.add(example);   return example; }

Core

Here is the main input method:

public static List<Keyword> guessFromString(String input) throws IOException {    TokenStream tokenStream = null;   try {      // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")     input = input.replaceAll("-+", "-0");     // replace any punctuation char but apostrophes and dashes by a space     input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");     // replace most common english contractions     input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");      // tokenize input     tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));     // to lowercase     tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);     // remove dots from acronyms (and "'s" but already done manually above)     tokenStream = new ClassicFilter(tokenStream);     // convert any char to ASCII     tokenStream = new ASCIIFoldingFilter(tokenStream);     // remove english stop words     tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());      List<Keyword> keywords = new LinkedList<Keyword>();     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);     tokenStream.reset();     while (tokenStream.incrementToken()) {       String term = token.toString();       // stem each term       String stem = stem(term);       if (stem != null) {         // create the keyword or get the existing one if any         Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));         // add its corresponding initial token         keyword.add(term.replaceAll("-0", "-"));       }     }      // reverse sort by frequency     Collections.sort(keywords);      return keywords;    } finally {     if (tokenStream != null) {       tokenStream.close();     }   }  }

Example

Using the guessFromString method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:

java         x12    [java] compil       x5     [compiled, compiler, compilers] sun          x5     [sun] develop      x4     [developed, developers] languag      x3     [languages, language] implement    x3     [implementation, implementations] applic       x3     [application, applications] run          x3     [run] origin       x3     [originally, original] gnu          x3     [gnu]

Iterate over the output list to know which were the original found words for each stem by getting the terms sets (displayed between brackets [...] in the above example).

What's next

Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)

118

answered Oct 02 '22 12:10

sp00m

Related questions
                            
                                How to disable auto creation of curly brackets({}) in eclipse?
                            
                                Is it possible to make a Spring ApplicationListener listen for 2 or more types of events?
                            
                                JPA count NamedQuery
                            
                                @POST in RESTful web service
                            
                                mockito return sequence of objects on spy method
                            
                                How to sort properties by name in IntelliJ IDEA debugger?
                            
                                How to use MigLayout? [closed]
                            
                                Why Java OutputStream.write() Takes Integer but Writes Bytes
                            
                                What flags are enabled by -XX:+AggressiveOpts on Sun JRE 1.6u20?
                            
                                How to convert java.lang.Object to ArrayList?
                            
                                Symmetric difference of two sets in Java
                            
                                dump object to String with Jackson
                            
                                Comparing Integer objects [duplicate]
                            
                                Difference between * and ? in Spring @Scheduled(cron=".....")
                            
                                Where do I put my XML beans in a Spring Boot application?
                            
                                What does the spring annotation @ConditionalOnMissingBean do?
                            
                                Message Oriented Middleware (MoM) Vs. Enterprise Service Bus (ESB)
                            
                                How to discover embedded Jetty port after requesting random available port?
                            
                                How does a wsimport generated client work?
                            
                                java.util.ConcurrentModificationException with iterator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java library for keywords extraction from input text [closed]

Tags:

java

keyword

extract

nlp

stemming