Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

n-gram character based similarity measure

Tags:

java

I have extracted bi-grams from words by using the code:

Scanner a = new Scanner(file1);
PrintWriter pw1= new PrintWriter(file2);  
    while (a.hasNext()) {
       String gram=a.next();
       pw1.println(gram);
       int line;
       line = gram.length(); 
       for(int x=0;x<=line-2;x++){
         pw1.println(gram.substring(x, x+2));        
       }
    }
    pw1.close();
}
catch(FileNotFoundException e) {
  System.err.format("FileNotExist");`
}

For example, the bi-grams of "student" are "st", "tu", "ud", "de", "en", "nt".

But I need to find similarity calculation.

I have to calculate similarity value between these splitted grams.

like image 291
jane Avatar asked Dec 26 '22 04:12

jane


1 Answers

Ummm, you're not explaining your question very well, but here's my shot at it.

First things first, your code is all over the place, even this small of a program, it's hard for anyone to read anything, I would edit it to make it readable but I'm not sure I'm allowed.

Anyways, bigrams = pairs of letters, words, or syllables (according to Google). And you're looking for the similarity value of it?

Well I did a little bit of research on it, and it seems that the algorithm you would need is this.

Formula for finding the similarity value of the bigrams of 2 words


Now, let's start fleshing this out. OP, correct me if I misunderstood your question. You are looking for the similarity between words by breaking them down into bigrams and finding their respective similarity values, yes? If so, let's break down this formula before we start to use it, cuz it will most certainly be the one that you need.

A specific example of the formula in action for the 2 words, FRANCE and FRENCH

Now, we have 2 words, FRANCE, and FRENCH. We want to find the similarity values of them if they were broken down into bigrams.

For FRANCE, we have {FR, RA, AN, NC, CE}
For FRENCH, we have {FR, RE, EN, NC, CH}

France and French are meant to represent s1 and s2 from the first equation. Next, we take matches that they have in both bigrams. What I mean by this is, which bigrams, or pairs of letters, can be found in both words?? In this case, the answer would be, FR and NC.

Since we found 2 pairs, the value on top turns into 4 since the formula states, 2 multiplied by the number of matching bigrams. So we have 4 on the top, and nothing else.

Now, the bottom half is solved by adding up the number of bigrams that can be made out of each of the words you are comparing, namely, 5 for FRANCE and 5 for FRENCH. So the denominator would be 10

So now, what do we have? We have 4/10, or .4. THAT is our similarity value, and that is the value you need to find when making your program.


Let's try another example just to root this into our heads, let's say

s1 = "GOLIATH"
s2 = "GOALIE"

So, using the bigrams, we come up with the array of Strings...

{"GO","OL","LI","IA","AT","TH"} {"GO","OA","AL","LI","IE"}

Now, the number of matches. How many matching bigrams are in these 2 words?? Answer - 2, GO and LI

so, the numerator would have

2 x {2 matches} = 4

Now, the denominator, we have 6 bigrams for Goliath and 5 bigrams for Goalie. Remember, we have to add these 2 values per the original formula, so our denominator would be 11.

So, where does that leave us??

S(GOLIATH, GOALIE) = 4/11 ~ .364 <----- Similarity value


I found the formulas (and basically everything I just now learned) under this link which really made things easy to understand.

http://www.catalysoft.com/articles/StrikeAMatch.html

I will edit this comment since it'll take me a while to come up with a method for your class, but just for a quick response, if you are looking for more help on how to do it, the link is a good place to start.

EDIT****

Ok, just built the method for it, here it is.

public class BiGram
{

/*

here are your imports

import java.util.Scanner;
import java.io.File;
import java.io.PrintWriter;
import java.io.FileNotFoundException;

*/
//you'll have to forgive the lack of order or common sense, I threw it 
//together fast I could cuz it sounded like you were in a rush

   public String[][] bigramizedWords = new String[2][100];

   public String[] words = new String[2];

   public File file1 = new File("file1.txt");
   public File file2 = new File("file2.txt");

   public int tracker = 0;
   public double matches = 0;
   public double denominator = 0; //This will hold the sum of the bigrams of the 2 words

   public double results;

   public Scanner a;
   public PrintWriter pw1;


   public BiGram()
   {

      initialize();
      bigramize();

      results = matches/denominator;

      pw1.println("\n\nThe Bigram Similarity value between " + words[0] + " and " + words[1] + " is " + results  + ".");


      pw1.close();


   }

   public static void main(String[] args)
   {

      BiGram b = new BiGram();


   }

   public void initialize()
   {

      try
      {

         a = new Scanner(file1);
         pw1 = new PrintWriter(file2);

         while (a.hasNext()) 
         {

            //System.out.println("Enter 2 words delimited by a space to calculate their similarity values based off of bigrams.");
            //^^^ I was going to use the above line, but I see you are using File and PrintWriter, 
            //I assume you will have the files yourself with the words to be compared

            String gram  = a.next();

            //pw1.println(gram);    -----you had this originally, we don't need this
            int line = gram.length(); 

            for(int x=0;x<=line-2;x++)
            {

               bigramizedWords[tracker][x] = gram.substring(x, x+2);
               pw1.println(gram.substring(x, x+2) + "");

            }

            pw1.println("");

            words[tracker] = gram;

            tracker++;

         }


      }

      catch(FileNotFoundException e) 
      {
         System.err.format("FileNotExist");
      }
   }

   public void bigramize()
   {

      denominator = (words[0].length() - 1) + (words[1].length() - 1); 
      //^^ Let me explain this, basically, for every word you have, let's say BABY and BALL,
      //the denominator is gonna be the sum of number of bigrams. In this case, BABY has {BA,AB,BY} or 3
      //bigrams, same for BALL, {BA,AL,LL} or 3. And the length of the word BABY is 4 letters, same 
      //with Ball. So basically, just subtract their respective lengths by 1 and add them together, and 
      //you get the number of bigrams combined from both words, or 6


      for(int k = 0; k < bigramizedWords[0].length; k++)
      {

         if(bigramizedWords[0][k] != null)
         {


            for(int i = 0; i < bigramizedWords[1].length; i++)
            {

            ///////////////////////////////////////////

               if(bigramizedWords[1][i] != null)
               {

                  if(bigramizedWords[0][k].equals(bigramizedWords[1][i]))
                  {

                     matches++;

                  }

               }

            }

         }

      }

      matches*=2;




      }

}
like image 75
DreadHeadedDeveloper Avatar answered Jan 08 '23 18:01

DreadHeadedDeveloper