Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate TF-IDF using Sql

I have a table in my DB containning a free text field column.

I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.

Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct me to it?

Many Thanks,

Jon

like image 672
Jon Avatar asked Jul 31 '10 09:07

Jon


People also ask

How do I calculate my TF-IDF?

The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

How is IDF calculated in TF-IDF?

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

How is TF-IDF calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

Why do we calculate IDF in TF-IDF?

The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact.


1 Answers

In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.

Edit: Actually even without creating a persistent full text index you can still leverage the parser

WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)

SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC

Returns

display_term                   Cnt
------------------------------ -----------
the                            3
brown                          2
lorry                          2
sea                            2
like image 156
Martin Smith Avatar answered Oct 09 '22 20:10

Martin Smith