Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.

I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )

Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.

Any help much appreciated, thanks

EDIT: Adding some example code

So if i have a UDF defined as:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

Any i try test with a query such as:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

I get error:

Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function

It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.

Again any help much appreciated, thanks.

like image 956
andrewm4894 Avatar asked Oct 30 '15 10:10

andrewm4894


People also ask

How do you compare two strings in BigQuery?

Comparing strings To do that we can use one of the STRING comparison functions: STARTS_WITH(value1, value2)-> Returns True/False if value1 starts with the substring value2.

How do you measure string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

What is string similarity search?

String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query autocompletion, and data cleaning needed in database, data warehouse, and data mining.

How do I combine strings in BigQuery?

The BigQuery CONCAT function allows you to combine (concatenate) one more values into a single result. Alternatively, you can use the concatenation operator || to achieve the same output.


2 Answers

Ready to use shared UDFs - Levenshtein distance:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

Fuzzy choose one:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

How-to:

  • https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83
like image 118
Felipe Hoffa Avatar answered Sep 24 '22 04:09

Felipe Hoffa


If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.

Steps:

  1. Download the javascript version of fuzzywuzzy (fuzzball)
  2. Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
  3. Upload it to a google cloud storage bucket
  4. Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data
like image 43
Colin Le Nost Avatar answered Sep 23 '22 04:09

Colin Le Nost