Dynamic algorithm for text auto-correct

Tags:

I am writing an auto-correct program that uses the levenshtein distance to correct a phrase of no more than 64 charcters based of a specific dictionary containing 8000 words.

The dictionary contains on each line the pair " Word word_frequency". I use DictionarEntry objects to store those pairs. Class Dictionar Entry has two fields : value : stores the word string freq : stores the frequency The dictionary is stored as a LinkedList. I read from stdin the 64 character string. before processing it I remove all the spaces. "Coo lweather" -> "Coolweather" I noticed that insead of calculating the levenshtein distance for every prefix , in the last row of the matrix calculated by the levenshtein dynamic( see wikipedia example ) it returns the distances for all the prefixes.

Function lev returns a vector containing the l.distance from second parameter string to all of the first's prefixes, including itself.

My issue is that I have to respect a few additional rules : min lev. distance -> min number of words -> maximum frequency sum -> minimum lexicographic That would be explained as if the total number of solutions is greater than 1 we take the ones with minimum number of words. If there are still more than one we follow the list of rules.

The dynamic I am applying is something similar to a knapsack dynamic. I don't know how to implement the minimum number of words rule ( the maximum frequency one is very similar)

Here is what I have tried so far input/output examples where this fails: "sore reserved" the answer should be so reserved , what I obtain is actually so re served I have chosen this method because it is more efficient. The time limit is 2 seconds for Java.

update : 7th of April . I have found the solution to my problem, however the cpu time is too large so I need to optimize it. It should be no higher than 2000 ms and it currently is at around 6000 ms. So now my main focus becomes optimizing it.

 public static String guess (String input, LinkedList<DictionarEntry> Dictionar){
       String curent = new String();
      String output = new String();

      int costMatrix[][][] = new int [input.length()][8000][input.length()];         
     int index[] = new int[128];
     int prev[]= new int[128];
        int d[]=new int  [128];
        int freq[]= new int[128];
        int wcount[]=new int[128];
        String values[] = new String[128];   
        for (int i=0 ; i < 128 ; i++){
                d[i]=127;
                freq[i]=0;
                wcount[i]=1;
                values[i]="";
        }           
     d[0]=0;
     freq[0]=0;

         for (int i = 0 ; i <input.length(); ++i){  

             curent=input.subSequence(i, input.length()).toString();
             long start =System.currentTimeMillis();
              for (int j = 0 ; j < Dictionar.size();++j){

                  costMatrix[i][j]=lev(Dictionar.get(j).value,curent);
                  for(int k=1;k<costMatrix[i][j].length;++k){

                      if(d[i]+costMatrix[i][j][k]<d[i+k]){
                          d[i+k]= d[i]+costMatrix[i][j][k];
                              values[i+k]=values[i]+Dictionar.get(j).value;
                              freq[i+k]=freq[i]+Dictionar.get(j).freq;
                              index[i+k]=j;
                              prev[i+k]=i;
                              wcount[i+k]=wcount[i]+1;
                      }
                       else if ((d[i]+costMatrix[i][j][k])==d[i+k])
                                        if((wcount[i]+1) <wcount[i+k]){
                              values[i+k]=values[i]+Dictionar.get(j).value;
                              freq[i+k]=freq[i]+Dictionar.get(j).freq;
                              index[i+k]=j;
                              prev[i+k]=i;
                              wcount[i+k]=wcount[i]+1;    
                                        }
                                        else if ((wcount[i]+1)==wcount[i+k])
                                         if((freq[i]+Dictionar.get(j).freq)>freq[i+k]){
                                             values[i+k]=values[i]+Dictionar.get(j).value;
                                             freq[i+k]=freq[i]+Dictionar.get(j).freq;
                                             index[i+k]=j;
                                             prev[i+k]=i;
                                             wcount[i+k]=wcount[i]+1;       
                                         }
                                         else if ((freq[i]+Dictionar.get(j).freq)==freq[i+k]){
                                             if((values[i]+Dictionar.get(j).value).compareTo(values[i+k])>0){
                                                 values[i+k]=values[i]+Dictionar.get(j).value;
                                              freq[i+k]=freq[i]+Dictionar.get(j).freq;
                                              index[i+k]=j;
                                              prev[i+k]=i;
                                              wcount[i+k]=wcount[i]+1;  
                                             }
                                         }
                  }     
              }
              long finished =System.currentTimeMillis();
                    System.out.println((finished-start)); 

      output="";

         } 

          int itr=input.length();
                   while(itr!=0){
      output = Dictionar.get(index[itr]).value + " " + output;
      itr=prev[itr]; 
  } 
     return output;
  }

Where should I implement the rules and how ( ideally in a more efficient way than using a matrix)?

In case there are any questions or I have left something unclear please feel free to ask

751

asked Apr 06 '12 11:04

pAndrei

1 Answers

Any reason why you can't use an existing library like Apache Lucene ? It supports fuzzy queries that use Levenshtein distance.

Other than that you might want to consider Suffix Trees to speed up partial string search

answered Oct 12 '22 10:10

Andrejs

Related questions
                            
                                Java content APIs for a large number of files
                            
                                LDAP integration with jBilling
                            
                                Open webcam and set as background (question)
                            
                                Is it possible to impose an upper bound (super X) on a named Generic type?
                            
                                Java old gen keeps growing - need help interpreting Java GC output
                            
                                How can I place limits on my canvas's translation matrix?
                            
                                Tools for native memory leak analysis
                            
                                Spring - ModelAttribute and parameter names
                            
                                Java Support for PMML
                            
                                How to use java.net.URI
                            
                                Java: how to execute a process which is not attached to a Windows console?
                            
                                Java IDL : servertool does not register and hangs
                            
                                How to transfer data to a serial port?
                            
                                How to know whether it is a memory leak or not when calling native code in Java?
                            
                                How to reconnect when the LDAP server is restarted?
                            
                                Java I/O: Ensure a file is not locked by another process before any r/w operation
                            
                                Is there a Java 7 to Java 6 converter?
                            
                                How to get all different sizes of system icon of a file
                            
                                Java Being Blocked By Antivirus Software (Specifically AVG)
                            
                                A tree query language for in-memory trees of objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dynamic algorithm for text auto-correct

Tags:

java

levenshtein-distance

dynamic-programming

autocorrect

pAndrei

People also ask

1 Answers

Andrejs

Recent Activity

Donate For Us