How can I compute TF/IDF with SQL (BigQuery)

1 Answers

This query works on 5 stages:

Obtain all reddit posts I'm interested in. Normalize words (LOWER, only letters and ', unescape some HTML). Split those words into an array.
Calculate the tf (term frequency) for each word in each doc - count how many times it shows up in each doc, relative to the number of words in said doc.
For each word, calculate the number of docs that contain it.
From (3.), obtain idf (inverse document frequency): "inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient"
Multiply tf*idf to obtain tf-idf.

This query manages to do this on one pass, by passing the obtained values up the chain.

#standardSQL
WITH words_by_post AS (
  SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
    REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&amp;', '&'), r'&[a-z]{2,4};', '*')
      , r'[a-z]{2,20}\'?[a-z]+') words
  , COUNT(*) OVER() docs_n
  FROM `fh-bigquery.reddit_comments.2017_07`  
  WHERE body NOT IN ('[deleted]', '[removed]')
  AND subreddit = 'movies'
  AND score > 100
), words_tf AS (
  SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc
    , ANY_VALUE(docs_n) docs_n
  FROM words_by_post, UNNEST(words) word
  GROUP BY id, word
  HAVING words_in_doc>30
), docs_idf AS (
  SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf
  FROM (
    SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n
    FROM words_tf
    GROUP BY 1
  ), UNNEST(tfs) tf
)    


SELECT *, tf*idf tfidf
FROM docs_idf
WHERE docs_with_word > 1
ORDER BY tfidf DESC
LIMIT 1000

enter image description here

156

answered Sep 18 '22 12:09

Felipe Hoffa

Related questions
                            
                                DATEADD MS -1 does nothing
                            
                                MySQL order by with condition
                            
                                SQL SERVER T-SQL Calculate SubTotal and Total by group
                            
                                create a blank table with no column
                            
                                What is the replacement of NULLIF in Hive?
                            
                                Delete all Large Objects from PostgreSQL database
                            
                                Maximum Number of Records a table variable can have in SQL Server
                            
                                SQL FORMAT function error
                            
                                SQL query that can find Typos in Arabic language
                            
                                Turn off upper-case for table and column names in HSQL?
                            
                                Combined 2 columns into one column SQL
                            
                                Add an autoincrementing ID column to an existing table with Sqlite
                            
                                dapper: Get result and count at same time using QueryMultiple
                            
                                How to update a part of a timestamp field in postgres?
                            
                                T-SQL - Insert Data into Parent and Child Tables
                            
                                ExecuteUpdate sql statement in Java not working
                            
                                Count multiple CASE occurrences in SQL
                            
                                SQL Computation of Cosine Similarity
                            
                                How to convert time without time zone to timestamp without time zone?
                            
                                SQL Server: how to use alias in update statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I compute TF/IDF with SQL (BigQuery)

Tags:

sql

google-bigquery

text-analysis

Felipe Hoffa

People also ask

1 Answers

Felipe Hoffa

Recent Activity

Donate For Us