Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I compute TF/IDF with SQL (BigQuery)

I'm doing text analysis over reddit comments, and I want to calculate the TF-IDF within BigQuery.

like image 531
Felipe Hoffa Avatar asked Oct 31 '17 05:10

Felipe Hoffa


People also ask

Does Google use TF-IDF?

Google uses TF-IDF to determine which terms are topically relevant (or irrelevant) by analyzing how often a term appears on a page (term frequency — TF) and how often it's expected to appear on an average page, based on a larger set of documents (inverse document frequency — IDF).

What is TF-IDF write its formula?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

How is IDF calculated in TF-IDF?

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).


1 Answers

This query works on 5 stages:

  1. Obtain all reddit posts I'm interested in. Normalize words (LOWER, only letters and ', unescape some HTML). Split those words into an array.
  2. Calculate the tf (term frequency) for each word in each doc - count how many times it shows up in each doc, relative to the number of words in said doc.
  3. For each word, calculate the number of docs that contain it.
  4. From (3.), obtain idf (inverse document frequency): "inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient"
  5. Multiply tf*idf to obtain tf-idf.

This query manages to do this on one pass, by passing the obtained values up the chain.

#standardSQL
WITH words_by_post AS (
  SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
    REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
      , r'[a-z]{2,20}\'?[a-z]+') words
  , COUNT(*) OVER() docs_n
  FROM `fh-bigquery.reddit_comments.2017_07`  
  WHERE body NOT IN ('[deleted]', '[removed]')
  AND subreddit = 'movies'
  AND score > 100
), words_tf AS (
  SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc
    , ANY_VALUE(docs_n) docs_n
  FROM words_by_post, UNNEST(words) word
  GROUP BY id, word
  HAVING words_in_doc>30
), docs_idf AS (
  SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf
  FROM (
    SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n
    FROM words_tf
    GROUP BY 1
  ), UNNEST(tfs) tf
)    


SELECT *, tf*idf tfidf
FROM docs_idf
WHERE docs_with_word > 1
ORDER BY tfidf DESC
LIMIT 1000

enter image description here

like image 156
Felipe Hoffa Avatar answered Sep 18 '22 12:09

Felipe Hoffa