I have a system that tracks what documents users view. Each document has its ID and a cluster that it belongs to. My system tracks the session ID and the number of views. I would now like to construct an SQL query which would give me two columns - the session ID and the classified cluster. The algorithm for classification is simple:
1. select all sessions
2. for each session S
I. prepare an accumulator ACC for clusters
II. select the clusters of viewed documents for this session
III. for each cluster C accumulate the cluster count ( ACC[C]++ )
IV. find the maximum in the ACC. That is the cluster that the session was classified to
The table structures are as follows, I'm using MySQL 5.5.16:
Session
+-------+-----------+--------------------+
| ID | sessionID | classified_cluster |
+-------+-----------+--------------------+
SessionDocument
+-------+-----------+------------+
| ID | sessionID | documentID |
+-------+-----------+------------+
Cluster
+-------+-------+
| ID | label |
+-------+-------+
ClusterDocument
+-------+-----------+------------+
| ID | clusterID | documentID |
+-------+-----------+------------+
So basically, I want to select the clusters for each session, count the occurrence of each cluster for viewed documents and find the maximum occurrence. Then the ID of the cluster that occurred the most, is the result for the session therefore the final result set holds the session ID and the most occurred cluster:
Result
+-----------+-----------------------+
| sessionID | classifiedIntoCluster |
+-----------+-----------------------+
I managed to get the clusters of viewed documents for each session (step 2/II.) with this query:
SELECT SD.session_id, CD.cluster_id
FROM cluster_document AS CD
INNER JOIN session_document AS SD
ON CD.document_id = SD.document_id
WHERE session_id IN (SELECT session_id FROM session)
I'm having trouble figuring out the rest. Is this even possible with nested SELECT queries? Should I use a cursor, and if yes, could someone show an example with a cursor? Any help will be much appreciated.
EDIT #1: added a C# implementation, MySQL dump and expected result
C# implementation
private void ClassifyUsers() {
int nClusters = Database.SelectClusterCount(); //get number of clusters
DataSet sessions = Database.SelectSessions(); //get all sessions
foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session
int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster
string s_id = session["session_id"].ToString();
DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session
foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster
int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1;
acc[c]++; //accumulate the cluster count
}
//find the maximum in the accumulator -> that is the most relevant cluster
int max = 0;
for (int j = 0; j < acc.Length; j++) {
if (acc[j] >= acc[max]) max = j;
}
max++;
Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster
}
}
Table structure, test data and expected result
Table structure and test data
Expected result
EDIT #2: added a smaller data set and further algorithm walkthrough
Here is a smaller data set:
SESSION
session id | cluster
abc 0
def 0
ghi 0
jkl 0
mno 0
CLUSTER
cluster_id | label
1 A
2 B
3 C
4 D
5 E
SESSION_DOCUMENT
id | session_id | document_id
1 abc 1
2 def 5
3 jkl 3
4 ghi 4
5 mno 2
6 def 2
7 abc 5
8 ghi 3
CLUSTER_DOCUMENT
id | cluster_id | document_id
1 1 2
2 1 3
3 2 5
4 3 5
5 3 1
6 4 3
7 5 2
8 5 4
Algorithm in detail
Step 1: get clusters for documents viewed by the session
session_id | cluster_id | label | document_id
abc 3 C 1
abc 2 B 5
abc 3 C 5
-----
def 2 B 5
def 3 C 5
def 1 A 2
def 5 E 2
----
ghi 5 E 4
ghi 1 A 3
ghi 4 D 3
----
jkl 1 A 3
jkl 4 D 3
----
mno 1 A 2
mno 5 E 2
Step 2: count occurrence of clusters
session_id | cluster_id | label | occurrence
abc 3 C 2 <--- MAX
abc 2 B 1
----
def 2 B 1
def 3 C 1
def 1 A 1
def 5 E 1 <--- MAX
----
ghi 5 E 1
ghi 1 A 1
ghi 4 D 1 <--- MAX
----
jkl 1 A 1
jkl 4 D 1 <--- MAX
----
mno 1 A 1
mno 5 E 1 <--- MAX
Step 3 (end result): find maximum occurred cluster for each session (see above) and construct the final result set (session_id, cluster_id):
session_id | cluster_id
abc 3
def 5
ghi 4
jkl 4
mno 5
EDIT #3: Accepted answer clarification
Both given answers are correct. They both provide a solution for the problem. I gave Mosty Mostacho the accepted answer because he delivered the solution first and provided another version of the solution with a VIEW
. The solution from mankuTimma is of the same quality as Mosty Mostacho's solution. Therefore, we have two equally good solutions, I just picked Mosty Mostacho because he was first.
Thanks to both of them for their contributions. .
Well I have some doubts about how to choose the an occurrence when there are many equal, but looking at the C# code it seems this selection is non-deterministic.
Now, given the sampla data Step 2 actually results in:
+------------+------------+-------+------------+
| SESSION_ID | CLUSTER_ID | LABEL | OCCURRENCE |
+------------+------------+-------+------------+
| abc | 3 | C | 2 |
| def | 1 | A | 1 |
| def | 2 | B | 1 |
| def | 3 | C | 1 |
| def | 5 | E | 1 |
| ghi | 1 | A | 1 |
| ghi | 4 | D | 1 |
| ghi | 5 | E | 1 |
| jkl | 1 | A | 1 |
| jkl | 4 | D | 1 |
| mno | 1 | A | 1 |
| mno | 5 | E | 1 |
+------------+------------+-------+------------+
So, continuing with this data, I get the session_id and the max(cluster_id) for that session id, resulting in:
+------------+------------+
| SESSION_ID | CLUSTER_ID |
+------------+------------+
| abc | 3 |
| def | 5 |
| ghi | 5 |
| jkl | 4 |
| mno | 5 |
+------------+------------+
The max(cluster_id) is just there to perform that non-deterministic selection. This is the query:
select s1.session_id, max(s1.cluster_id) as cluster_id from (
select sd.session_id, cd.cluster_id, count(*) as Occurrence
from session_document sd
join cluster_document cd
on sd.document_id = cd.document_id
join cluster c
on c.cluster_id = cd.cluster_id
group by sd.session_id, cd.cluster_id, c.label
) as s1
left join (
select sd.session_id, count(*) as Occurrence
from session_document sd
join cluster_document cd
on sd.document_id = cd.document_id
join cluster c
on c.cluster_id = cd.cluster_id
group by sd.session_id, cd.cluster_id, c.label
) as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id
Maybe adding a view would improve performance (replacement of above query):
create view MaxOccurrences as (
select sd.session_id, cd.cluster_id, count(*) as Occurrence
from session_document sd
join cluster_document cd
on sd.document_id = cd.document_id
join cluster c
on c.cluster_id = cd.cluster_id
group by sd.session_id, cd.cluster_id, c.label
);
select s1.session_id, max(s1.cluster_id) as cluster_id
from MaxOccurrences as s1
left join MaxOccurrences as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id
Let me know if it works.
If I understand your problem correctly, for every session you want the cluster with the most number of document views. So, here you go The below query returns the maximum count or the number of occurrences of a particular cluster id for every session id.
SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD
INNER JOIN SESSION_DOCUMENT AS SD
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID
Then join the above result with the sub-query(where I am calculating the count) again to get the cluster id of the maximum occurrence. In case there are two cluster id's with same number of occurrences I am using the cluster id with maximum value. I have tested on your data and it works. Also, now this query should work on all databases.
SELECT B.SESSION_ID, MAX(CNT2.CLUSTER_ID) FROM
(SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD
INNER JOIN SESSION_DOCUMENT AS SD
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID) B
JOIN (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD
INNER JOIN SESSION_DOCUMENT AS SD
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT2
ON B.SESSION_ID = CNT2.SESSION_ID
AND B.MAX_CNT = CNT2.CNT
GROUP BY B.SESSION_ID
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With