Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL classification

I have a system that tracks what documents users view. Each document has its ID and a cluster that it belongs to. My system tracks the session ID and the number of views. I would now like to construct an SQL query which would give me two columns - the session ID and the classified cluster. The algorithm for classification is simple:

1. select all sessions
2. for each session S
   I. prepare an accumulator ACC for clusters
   II. select the clusters of viewed documents for this session
   III. for each cluster C accumulate the cluster count ( ACC[C]++ )
   IV. find the maximum in the ACC. That is the cluster that the session was classified to

The table structures are as follows, I'm using MySQL 5.5.16:

Session

+-------+-----------+--------------------+
| ID    | sessionID | classified_cluster |
+-------+-----------+--------------------+

SessionDocument

+-------+-----------+------------+
| ID    | sessionID | documentID |
+-------+-----------+------------+

Cluster

+-------+-------+
| ID    | label |
+-------+-------+

ClusterDocument

+-------+-----------+------------+
| ID    | clusterID | documentID |
+-------+-----------+------------+

So basically, I want to select the clusters for each session, count the occurrence of each cluster for viewed documents and find the maximum occurrence. Then the ID of the cluster that occurred the most, is the result for the session therefore the final result set holds the session ID and the most occurred cluster:

Result

+-----------+-----------------------+
| sessionID | classifiedIntoCluster |
+-----------+-----------------------+

I managed to get the clusters of viewed documents for each session (step 2/II.) with this query:

SELECT SD.session_id, CD.cluster_id 
FROM cluster_document AS CD 
INNER JOIN session_document AS SD 
ON CD.document_id = SD.document_id
WHERE session_id IN (SELECT session_id FROM session) 

I'm having trouble figuring out the rest. Is this even possible with nested SELECT queries? Should I use a cursor, and if yes, could someone show an example with a cursor? Any help will be much appreciated.

EDIT #1: added a C# implementation, MySQL dump and expected result

C# implementation

    private void ClassifyUsers() {
        int nClusters = Database.SelectClusterCount(); //get number of clusters
        DataSet sessions = Database.SelectSessions(); //get all sessions
        foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session
           int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster
           string s_id = session["session_id"].ToString();
           DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session

           foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster
               int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1;
               acc[c]++; //accumulate the cluster count
           }

           //find the maximum in the accumulator -> that is the most relevant cluster
           int max = 0;
           for (int j = 0; j < acc.Length; j++) {
               if (acc[j] >= acc[max]) max = j;
           }
           max++;
           Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster
       }
    }

Table structure, test data and expected result

Table structure and test data

Expected result

EDIT #2: added a smaller data set and further algorithm walkthrough

Here is a smaller data set:

SESSION

session id    |  cluster
abc                 0
def                 0
ghi                 0
jkl                 0       
mno                 0

CLUSTER

cluster_id  | label
1               A
2               B
3               C
4               D
5               E

SESSION_DOCUMENT

id      | session_id    |   document_id
1           abc             1
2           def             5
3           jkl             3
4           ghi             4
5           mno             2
6           def             2
7           abc             5
8           ghi             3

CLUSTER_DOCUMENT

id      | cluster_id    |   document_id
1           1                  2
2           1                  3
3           2                  5
4           3                  5
5           3                  1
6           4                  3
7           5                  2
8           5                  4

Algorithm in detail

Step 1: get clusters for documents viewed by the session

session_id  |  cluster_id   | label     | document_id   
abc             3               C           1
abc             2               B           5
abc             3               C           5
-----
def             2               B           5
def             3               C           5   
def             1               A           2
def             5               E           2   
----
ghi             5               E           4   
ghi             1               A           3   
ghi             4               D           3   
----
jkl             1               A           3   
jkl             4               D           3   
----
mno             1               A           2
mno             5               E           2

Step 2: count occurrence of clusters

session_id |    cluster_id  | label |   occurrence
abc             3               C           2   <--- MAX
abc             2               B           1
----
def             2               B           1
def             3               C           1   
def             1               A           1
def             5               E           1   <--- MAX
----
ghi             5               E           1   
ghi             1               A           1   
ghi             4               D           1   <--- MAX
----
jkl             1               A           1   
jkl             4               D           1   <--- MAX
----
mno             1               A           1   
mno             5               E           1   <--- MAX

Step 3 (end result): find maximum occurred cluster for each session (see above) and construct the final result set (session_id, cluster_id):

session_id |    cluster_id  
abc                 3           
def                 5
ghi                 4
jkl                 4
mno                 5

EDIT #3: Accepted answer clarification

Both given answers are correct. They both provide a solution for the problem. I gave Mosty Mostacho the accepted answer because he delivered the solution first and provided another version of the solution with a VIEW. The solution from mankuTimma is of the same quality as Mosty Mostacho's solution. Therefore, we have two equally good solutions, I just picked Mosty Mostacho because he was first.

Thanks to both of them for their contributions. .

like image 971
brozo Avatar asked Feb 16 '12 09:02

brozo


2 Answers

Well I have some doubts about how to choose the an occurrence when there are many equal, but looking at the C# code it seems this selection is non-deterministic.

Now, given the sampla data Step 2 actually results in:

+------------+------------+-------+------------+
| SESSION_ID | CLUSTER_ID | LABEL | OCCURRENCE |
+------------+------------+-------+------------+
| abc        |          3 | C     |          2 |
| def        |          1 | A     |          1 |
| def        |          2 | B     |          1 |
| def        |          3 | C     |          1 |
| def        |          5 | E     |          1 |
| ghi        |          1 | A     |          1 |
| ghi        |          4 | D     |          1 |
| ghi        |          5 | E     |          1 |
| jkl        |          1 | A     |          1 |
| jkl        |          4 | D     |          1 |
| mno        |          1 | A     |          1 |
| mno        |          5 | E     |          1 |
+------------+------------+-------+------------+

So, continuing with this data, I get the session_id and the max(cluster_id) for that session id, resulting in:

+------------+------------+
| SESSION_ID | CLUSTER_ID |
+------------+------------+
| abc        |          3 |
| def        |          5 |
| ghi        |          5 |
| jkl        |          4 |
| mno        |          5 |
+------------+------------+

The max(cluster_id) is just there to perform that non-deterministic selection. This is the query:

select s1.session_id, max(s1.cluster_id) as cluster_id from (
  select sd.session_id, cd.cluster_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
) as s1
left join (
  select sd.session_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
) as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id

Maybe adding a view would improve performance (replacement of above query):

create view MaxOccurrences as (
  select sd.session_id, cd.cluster_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
);

select s1.session_id, max(s1.cluster_id) as cluster_id
from MaxOccurrences as s1
left join MaxOccurrences as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id

Let me know if it works.

like image 68
Mosty Mostacho Avatar answered Sep 22 '22 02:09

Mosty Mostacho


If I understand your problem correctly, for every session you want the cluster with the most number of document views. So, here you go The below query returns the maximum count or the number of occurrences of a particular cluster id for every session id.

SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID

Then join the above result with the sub-query(where I am calculating the count) again to get the cluster id of the maximum occurrence. In case there are two cluster id's with same number of occurrences I am using the cluster id with maximum value. I have tested on your data and it works. Also, now this query should work on all databases.

SELECT B.SESSION_ID, MAX(CNT2.CLUSTER_ID) FROM 
(SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID) B
JOIN (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT2
ON B.SESSION_ID = CNT2.SESSION_ID
AND B.MAX_CNT = CNT2.CNT
GROUP BY B.SESSION_ID
like image 43
mankuTimma Avatar answered Sep 21 '22 02:09

mankuTimma