I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified. Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity? (I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.) Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query... Here's what I have so far: <pre class="prettyprint"><code>CREATE TABLE `each_entity_word` ( `word` varchar(20) NOT NULL, `entity_id` int(10) unsigned NOT NULL, `word_count` mediumint(8) unsigned NOT NULL, PRIMARY KEY (`word`, `entity_id`) ); CREATE TABLE `each_entity_sum` ( `entity_id` int(10) unsigned NOT NULL DEFAULT '0', `word_count_sum` int(10) unsigned DEFAULT NULL, `doc_count` mediumint(8) unsigned NOT NULL, PRIMARY KEY (`entity_id`) ); CREATE TABLE `total_entity_word` ( `word` varchar(20) NOT NULL, `word_count` int(10) unsigned NOT NULL, PRIMARY KEY (`word`) ); CREATE TABLE `total_entity_sum` ( `word_count_sum` bigint(20) unsigned NOT NULL, `doc_count` int(10) unsigned NOT NULL, `pkey` enum('singleton') NOT NULL DEFAULT 'singleton', PRIMARY KEY (`pkey`) ); </code></pre> Each article in the marked data is split into distinct words, and for each article for each entity every word is added to <code>each_entity_word</code> and/or its <code>word_count</code> is incremented and <code>doc_count</code> is incremented in <code>entity_word_sum</code>, both with respect to an <code>entity_id</code>. This is repeated for each entity known to be mentioned in that article. For each article regardless of the entities contained within for each word <code>total_entity_word</code> <code>total_entity_word_sum</code> are similarly incremented. <ul> <li>P(word|any document) should equal the <code>word_count</code> in <code>total_entity_word</code> for that word over <code>doc_count</code> in <code>total_entity_sum</code> </li> <li>P(word|document mentions entity x) should equal <code>word_count</code> in <code>each_entity_word</code> for that word for <code>entity_id</code> x over <code>doc_count</code> in <code>each_entity_sum</code> for <code>entity_id</code> x </li> <li>P(word|document does not mention entity x) should equal (the <code>word_count</code> in <code>total_entity_word</code> minus its <code>word_count</code> in <code>each_entity_word</code> for that word for that entity) over (the <code>doc_count</code> in <code>total_entity_sum</code> minus <code>doc_count</code> for that entity in <code>each_entity_sum</code>)</li> <li>P(document mentions entity x) should equal <code>doc_count</code> in <code>each_entity_sum</code> for that entity id over <code>doc_count</code> in <code>total_entity_word</code> </li> <li>P(document does not mention entity x) should equal 1 minus (<code>doc_count</code> in <code>each_entity_sum</code> for x's entity id over <code>doc_count</code> in <code>total_entity_word</code>).</li> </ul> For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either <code>each_entity_word</code> or <code>total_entity_word</code>. In the db platform I'm working with (mysql) IN clauses are relatively well optimized. Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x). So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql? EDIT: Try #1: <pre class="prettyprint"><code>set @entity_id = 1; select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id; select @total_doc_count = doc_count from total_entity_sum; select exp( log(@entity_doc_count / @total_doc_count) + ( sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count))) ) ) as likelihood, from total_entity_word aew left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id where aew.word in ('I', 'want', 'to', 'use'...); </code></pre>

Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast. http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html

Naive bayes calculation in sql

Tags:

sql

mysql

machine-learning

nlp

bayesian

I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.

Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?

(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)

Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...

Here's what I have so far:

CREATE TABLE `each_entity_word` (
  `word` varchar(20) NOT NULL,
  `entity_id` int(10) unsigned NOT NULL,
  `word_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`word`, `entity_id`)
);

CREATE TABLE `each_entity_sum` (
  `entity_id` int(10) unsigned NOT NULL DEFAULT '0',
  `word_count_sum` int(10) unsigned DEFAULT NULL,
  `doc_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`entity_id`)
);

CREATE TABLE `total_entity_word` (
  `word` varchar(20) NOT NULL,
  `word_count` int(10) unsigned NOT NULL,
  PRIMARY KEY (`word`)
);

CREATE TABLE `total_entity_sum` (
  `word_count_sum` bigint(20) unsigned NOT NULL,
  `doc_count` int(10) unsigned NOT NULL,
  `pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
  PRIMARY KEY (`pkey`)
);

Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word and/or its word_count is incremented and doc_count is incremented in entity_word_sum, both with respect to an entity_id. This is repeated for each entity known to be mentioned in that article.

For each article regardless of the entities contained within for each word total_entity_word total_entity_word_sum are similarly incremented.

P(word|any document) should equal the word_count in total_entity_word for that word over doc_count in total_entity_sum
P(word|document mentions entity x) should equal word_count in each_entity_word for that word for entity_id x over doc_count in each_entity_sum for entity_id x
P(word|document does not mention entity x) should equal (the word_count in total_entity_word minus its word_count in each_entity_word for that word for that entity) over (the doc_count in total_entity_sum minus doc_count for that entity in each_entity_sum)
P(document mentions entity x) should equal doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
P(document does not mention entity x) should equal 1 minus (doc_count in each_entity_sum for x's entity id over doc_count in total_entity_word).

For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word or total_entity_word. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.

Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).

So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?

EDIT:

Try #1:

set @entity_id = 1;

select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;

select @total_doc_count = doc_count from total_entity_sum;

select 
            exp(

                log(@entity_doc_count / @total_doc_count) + 

                (
                    sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / 
                    sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
                )

            ) as likelihood,
        from total_entity_word aew 
        left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id

        where aew.word in ('I', 'want', 'to', 'use'...);

430

asked Apr 13 '09 15:04

ʞɔıu

2 Answers

Use an R to Postgres (or MySQL, etc.) interface

Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:

http://rpgsql.sourceforge.net/

bnd.pr> data(airquality)

bnd.pr> db.write.table(airquality, no.clobber = F)

bnd.pr> bind.proxy("airquality")

bnd.pr> summary(airquality)
Table name: airquality 
Database: test 
Host: localhost
Dimensions: 6 (columns) 153 (rows)


bnd.pr> print(airquality)
   Day Month Ozone Solar.R Temp
1    1     5    41     190   67
2    2     5    36     118   72
3    3     5    12     149   74
4    4     5    18     313   62
5    5     5    NA      NA   56
6    6     5    28      NA   66
7    7     5    23     299   65
8    8     5    19      99   59
9    9     5     8      19   61
10  10     5    NA     194   69
Continues for 143 more rows and 1 more cols...

bnd.pr> airquality[50:55, ]
   Ozone Solar.R Wind Temp Month Day
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24

bnd.pr> airquality[["Ozone"]]
  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:

[ramanujan:~/base]$R

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


 ~/.Rprofile loaded.
Welcome at  Sun Apr 19 00:45:30 2009
> install.packages("e1071")  
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)

More info:

http://cran.r-project.org/web/packages/e1071/index.html

answered Nov 02 '22 23:11

ramanujan

Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.

http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html

answered Nov 03 '22 00:11

colin campbell

Related questions
                            
                                MySQL select top rows with same condition values
                            
                                Check whether value exists in column for each group
                            
                                elasticsearch match two fields
                            
                                Best Practice: Select * on CTE
                            
                                Count SQL records based on sibling property
                            
                                Finding gaps in huge event streams?
                            
                                MariaDB password reset not working
                            
                                Get previous Tuesday (or any given day of week) for specified date
                            
                                Removing duplicates from one column only
                            
                                Postgres coalesce to empty JSONB array
                            
                                postgresql update column from select
                            
                                SQL Server Full Text Search around numbers and underscores
                            
                                SQL check if value exists in a partition using CASE WHEN without any JOIN
                            
                                Want to update dates in date column, with the past dates from yesterday's date
                            
                                OLEDB Connection has no refresh date
                            
                                Unnesting structs in BigQuery
                            
                                Dynamic Alphabetical Navigation
                            
                                How to select posts with specific tags/categories in WordPress
                            
                                Recursive SQL CTE's and Custom Sort Ordering
                            
                                running conditional DDL statements on sql server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With