I'm looking for a Java library to extract keywords from a block of text.
The process should be as follows:
stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.
Is there a library that performs this task?
Rapid Automatic Keyword Extraction(RAKE) is a Domain-Independent keyword extraction algorithm in Natural Language Processing. 2. It is an Individual document-oriented dynamic Information retrieval method.
Keyword extraction technique will sift through the whole set of data in minutes and obtain the words and phrases that best describe each subject. This way, you can easily identify which parts of the available data cover the subjects you are looking for while saving your teams many hours of manual processing.
Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar
, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar
from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).
Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).
One keyword for one stem. Different words may have the same stem, hence the terms
set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).
public class Keyword implements Comparable<Keyword> { private final String stem; private final Set<String> terms = new HashSet<String>(); private int frequency = 0; public Keyword(String stem) { this.stem = stem; } public void add(String term) { terms.add(term); frequency++; } @Override public int compareTo(Keyword o) { // descending order return Integer.valueOf(o.frequency).compareTo(frequency); } @Override public boolean equals(Object obj) { if (this == obj) { return true; } else if (!(obj instanceof Keyword)) { return false; } else { return stem.equals(((Keyword) obj).stem); } } @Override public int hashCode() { return Arrays.hashCode(new Object[] { stem }); } public String getStem() { return stem; } public Set<String> getTerms() { return terms; } public int getFrequency() { return frequency; } }
To stem a word:
public static String stem(String term) throws IOException { TokenStream tokenStream = null; try { // tokenize tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term)); // stem tokenStream = new PorterStemFilter(tokenStream); // add each token in a set, so that duplicates are removed Set<String> stems = new HashSet<String>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if no stem or 2+ stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { tokenStream.close(); } } }
To search into a collection (will be used by the list of potential keywords):
public static <T> T find(Collection<T> collection, T example) { for (T element : collection) { if (element.equals(example)) { return element; } } collection.add(example); return example; }
Here is the main input method:
public static List<Keyword> guessFromString(String input) throws IOException { TokenStream tokenStream = null; try { // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific") input = input.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes by a space input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " "); // replace most common english contractions input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", ""); // tokenize input tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input)); // to lowercase tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream); // remove dots from acronyms (and "'s" but already done manually above) tokenStream = new ClassicFilter(tokenStream); // convert any char to ASCII tokenStream = new ASCIIFoldingFilter(tokenStream); // remove english stop words tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet()); List<Keyword> keywords = new LinkedList<Keyword>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = token.toString(); // stem each term String stem = stem(term); if (stem != null) { // create the keyword or get the existing one if any Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-"))); // add its corresponding initial token keyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(keywords); return keywords; } finally { if (tokenStream != null) { tokenStream.close(); } } }
Using the guessFromString
method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:
java x12 [java] compil x5 [compiled, compiler, compilers] sun x5 [sun] develop x4 [developed, developers] languag x3 [languages, language] implement x3 [implementation, implementations] applic x3 [application, applications] run x3 [run] origin x3 [originally, original] gnu x3 [gnu]
Iterate over the output list to know which were the original found words for each stem by getting the terms
sets (displayed between brackets [...]
in the above example).
Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With