I am writing an auto-correct program that uses the levenshtein distance to correct a phrase of no more than 64 charcters based of a specific dictionary containing 8000 words.
The dictionary contains on each line the pair " Word word_frequency". I use DictionarEntry objects to store those pairs. Class Dictionar Entry has two fields : value : stores the word string freq : stores the frequency The dictionary is stored as a LinkedList. I read from stdin the 64 character string. before processing it I remove all the spaces. "Coo lweather" -> "Coolweather" I noticed that insead of calculating the levenshtein distance for every prefix , in the last row of the matrix calculated by the levenshtein dynamic( see wikipedia example ) it returns the distances for all the prefixes.
Function lev returns a vector containing the l.distance from second parameter string to all of the first's prefixes, including itself.
My issue is that I have to respect a few additional rules : min lev. distance -> min number of words -> maximum frequency sum -> minimum lexicographic That would be explained as if the total number of solutions is greater than 1 we take the ones with minimum number of words. If there are still more than one we follow the list of rules.
The dynamic I am applying is something similar to a knapsack dynamic. I don't know how to implement the minimum number of words rule ( the maximum frequency one is very similar)
Here is what I have tried so far input/output examples where this fails: "sore reserved" the answer should be so reserved , what I obtain is actually so re served I have chosen this method because it is more efficient. The time limit is 2 seconds for Java.
update : 7th of April . I have found the solution to my problem, however the cpu time is too large so I need to optimize it. It should be no higher than 2000 ms and it currently is at around 6000 ms. So now my main focus becomes optimizing it.
public static String guess (String input, LinkedList<DictionarEntry> Dictionar){
String curent = new String();
String output = new String();
int costMatrix[][][] = new int [input.length()][8000][input.length()];
int index[] = new int[128];
int prev[]= new int[128];
int d[]=new int [128];
int freq[]= new int[128];
int wcount[]=new int[128];
String values[] = new String[128];
for (int i=0 ; i < 128 ; i++){
d[i]=127;
freq[i]=0;
wcount[i]=1;
values[i]="";
}
d[0]=0;
freq[0]=0;
for (int i = 0 ; i <input.length(); ++i){
curent=input.subSequence(i, input.length()).toString();
long start =System.currentTimeMillis();
for (int j = 0 ; j < Dictionar.size();++j){
costMatrix[i][j]=lev(Dictionar.get(j).value,curent);
for(int k=1;k<costMatrix[i][j].length;++k){
if(d[i]+costMatrix[i][j][k]<d[i+k]){
d[i+k]= d[i]+costMatrix[i][j][k];
values[i+k]=values[i]+Dictionar.get(j).value;
freq[i+k]=freq[i]+Dictionar.get(j).freq;
index[i+k]=j;
prev[i+k]=i;
wcount[i+k]=wcount[i]+1;
}
else if ((d[i]+costMatrix[i][j][k])==d[i+k])
if((wcount[i]+1) <wcount[i+k]){
values[i+k]=values[i]+Dictionar.get(j).value;
freq[i+k]=freq[i]+Dictionar.get(j).freq;
index[i+k]=j;
prev[i+k]=i;
wcount[i+k]=wcount[i]+1;
}
else if ((wcount[i]+1)==wcount[i+k])
if((freq[i]+Dictionar.get(j).freq)>freq[i+k]){
values[i+k]=values[i]+Dictionar.get(j).value;
freq[i+k]=freq[i]+Dictionar.get(j).freq;
index[i+k]=j;
prev[i+k]=i;
wcount[i+k]=wcount[i]+1;
}
else if ((freq[i]+Dictionar.get(j).freq)==freq[i+k]){
if((values[i]+Dictionar.get(j).value).compareTo(values[i+k])>0){
values[i+k]=values[i]+Dictionar.get(j).value;
freq[i+k]=freq[i]+Dictionar.get(j).freq;
index[i+k]=j;
prev[i+k]=i;
wcount[i+k]=wcount[i]+1;
}
}
}
}
long finished =System.currentTimeMillis();
System.out.println((finished-start));
output="";
}
int itr=input.length();
while(itr!=0){
output = Dictionar.get(index[itr]).value + " " + output;
itr=prev[itr];
}
return output;
}
Where should I implement the rules and how ( ideally in a more efficient way than using a matrix)?
In case there are any questions or I have left something unclear please feel free to ask
Auto-correct is a feature available on Android and Apple smartphones, PCs and applications like Outlook. Its goal is to make typing a little easier by automatically changing words, letters or punctuation it thinks you entered incorrectly.
Auto-correct is a type of software program that identifies misspelled words, uses algorithms to identify the words most likely to have been intended, and edits the text accordingly.
In the backdrop of machine learning, autocorrect is purely based on Natural Language Processing (NLP). As the name suggests that it is programmed in order to correct spellings and errors while typing text.
The Autocorrect model is programmed to correct spellings and errors while inputting text and locating the most comparable related words. It is completely based on NLP that compares the words in the vocabulary dictionary and the typed words on the keyboard.
Any reason why you can't use an existing library like Apache Lucene ? It supports fuzzy queries that use Levenshtein distance.
Other than that you might want to consider Suffix Trees to speed up partial string search
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With