Is there a way to measure string similarity in Google BigQuery

Tags:

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.

I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )

Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.

Any help much appreciated, thanks

EDIT: Adding some example code

So if i have a UDF defined as:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

Any i try test with a query such as:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

I get error:

Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function

It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.

Again any help much appreciated, thanks.

956

asked Oct 30 '15 10:10

andrewm4894

2 Answers

Ready to use shared UDFs - Levenshtein distance:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

Fuzzy choose one:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

How-to:

https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83

118

answered Sep 24 '22 04:09

Felipe Hoffa

If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.

Steps:

Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)

CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data

answered Sep 23 '22 04:09

Colin Le Nost

Related questions
                            
                                Regex in chrome.declarativeContent.PageStateMatcher
                            
                                innerWidth and outerWidth oddness on desktop
                            
                                Parse semi-structured values
                            
                                using jQuery to get url and extract url segments
                            
                                Script tags in TinyMCE fields are not saving correctly
                            
                                Lazy Loading/More Data Scroll in Mongoose/Nodejs
                            
                                Angularjs - how do i access directive attribute in controller
                            
                                how to select a dropdown value in selenium webdriver using node.js
                            
                                IE 11 Pointer Events Override
                            
                                Owl Carousel 2 random function
                            
                                WebGL iOS render to floating point texture
                            
                                Why the browser doesn't send cookies while requesting a JavaScript file?
                            
                                How can retrieve string formData js in c#
                            
                                jQuery contains doesn't work on Chrome
                            
                                NodeJS Express - Show URL without using html extension
                            
                                Why isnt window.location.href= not forwarding to page using Safari?
                            
                                How to clear the javascript console programmatically?
                            
                                Leaflet only loads one tile
                            
                                JS: how to shift each letter in the given string N places down in the alphabet?
                            
                                Chrome extension identity.email empty

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to measure string similarity in Google BigQuery

Tags:

javascript

regex

google-bigquery

udf

andrewm4894

People also ask

2 Answers

Felipe Hoffa

Colin Le Nost

Recent Activity

Donate For Us