I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.
The string contains:
The algorithm has the following requirements:
It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!
Just looking for advice and directions:
Many thanks.
P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.
Keywords refer to the important phrases/expressions that are representative of the underlying document. Keywords from a document can accurately describe the document's content and can facilitate fast information processing.
Rapid Automatic Keyword Extraction(RAKE) is a Domain-Independent keyword extraction algorithm in Natural Language Processing. 2. It is an Individual document-oriented dynamic Information retrieval method.
To open the Find pane from the Edit View, press Ctrl+F, or click Home > Find. Find text by typing it in the Search the document for… box.
The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.
To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T>
(I'll be using C# here)
// Assuming that each line contains one stop word.
HashSet<string> stopWords =
new HashSet<string>(File.ReadLines("C:\stopwords.txt"), StringComparer.OrdinalIgnoreCase);
Later you can look if a keyword candidate is in the stop word list with
If (!stopWords.Contains(candidate)) {
// We have a keyword
}
HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.
Looking for the keywords can easily be done with Regex.
string text = ...; // Load text from DB
MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
RegexOptions.IgnoreCase);
foreach (Match match in matches) {
if (!stopWords.Contains(match.Value)) {
ProcessKeyword(match.Value); // Do whatever you need to do here
}
}
If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to @"\p{L}([:']?\p{L})*"
. The character class \p{L}
contains all letters and letter modifiers.
The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.
Splitting the text into phrases involves searching for sentences ending with "." or "?" or "!" or ":". You should exclude dots and colons that appear within a word.
string[] phrases = Regex.Split(text, @"[\.\?!:](\s|$)");
This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With