Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comment threading with score factoring

I am banging my head against something and I was wondering if somebody more skilled that me could help me out.

My aim is to create a comment thread that factors in a system of comment scoring.

First I'll explain where I am currently.

Say we have a comment thread on an article that looks like the example below. The number in parenthasis is the ID of that comment. ID's are assigned automatically by the database and increment chronologically with each additional comment posted. The number of dashes before the comment text reperesent the comment depth.

(01)"This is a top level comment." 
(02)-"This is a second level comment. A reply to the top level comment above."
(06)-"This is also a second level comment / another reply to comment 01."
(07)--"This is a reply to comment 06."
(03)"This is a different top level comment."
(05)-"This is a reply to the comment above."
(08)--"This is a reply to that comment in turn."
(10)---"This is a deeper comment still."
(04)"This is one more top level comment."
(09)-"This is one more reply."

My first problem was storing this data in a way that means it can be returned in the correct order. If you simply store a depth field and order by depth, it'll bring back all of the top level comments first and then the second level comments etc. This isn't right, we must return the comments with the full parentage still intact.

One way to achieve this is to store the full parentage for each comment.

Comment ID  | Parentage
     01     |              (Comment 01 has no parent because it is top level)
     02     | 01-          (Comment 02 was a reply to comment 01)
     03     | 
     04     |              
     05     | 03-
     06     | 01-
     07     | 01-06-       (Comment 07 has two ancestors 01 and then 06)
     08     | 03-05-
     09     | 04-
     10     | 03-05-08-

Adding another comment record is as simple as grabbing the Parentage from the comment that you are replying to, and appending its ID to form the new parentage. For example, if I was replying to comment 10, I would take it's parentage (03-05-08-) and append its ID (10-). The database would automatically recognise it as the 11th comment, and we'd get:

Comment ID  | Parentage
     01     | 
     02     | 01- 
     03     | 
     04     |              
     05     | 03-
     06     | 01-
     07     | 01-06-
     08     | 03-05-
     09     | 04-
     10     | 03-05-08-
     11     | 03-05-08-10-

Now, when we order the comments for display, we order on a concatenation of Parentage and Comment ID which gives us:

Order by CONCAT(Parentage, ID)

Comment ID  | Parentage    |   CONCAT(Parentage, ID)
     01     |              |   01-
     02     | 01-          |   01-02-
     06     | 01-          |   01-06-
     07     | 01-06-       |   01-06-07-
     03     |              |   03-
     05     | 03-          |   03-05-
     08     | 03-05-       |   03-05-08-
     10     | 03-05-08-    |   03-05-08-10-
     11     | 03-05-08-10- |   03-05-08-10-11-
     04     |              |   04-
     09     | 04-          |   04-09-

Which produces the exact same list as first demonstrated. With Comment 11 which we later added inserted in the correct place:

(01)"This is a top level comment." 
(02)-"This is a reply to the top level comment."
(06)-"This is another reply that was posted later than the first."
(07)--"This is a reply to the second level comment directly above."
(03)"This is a different top level comment."
(05)-"This is a reply to the comment above."
(08)--"This is a reply to the comment above."
(10)---"This is a deeper comment still."
(11)----"THIS COMMENT WAS ADDED IN THE EARLIER EXAMPLE."
(04)"This is one more top level comment."
(09)-"This is one more reply."

Indenting can be done by checking the length of the CONCAT string and multiplying the len(CONCAT(Parentage, ID)) by a set number of pixels. This is great, we have a system of storing comments in a way that recognises their parentage.

Now the problem:

Not all comments are equal. A system of comment scoring is needed to distinguish good comments. Let's say each comment has an upvote button.. while we want to retain parentage, if one comment has two direct replies at the same level then we want to show the one with the most upvotes first. I'll add some votes in [square brackets] below.

(01)"This is a top level comment." [6 votes]
(02)-"This is a reply to the top level comment." [2 votes]
(06)-"This is another reply that was posted later than the first." [30 votes]
(07)--"This is a reply to the second level comment directly above." [5 votes]
(03)"This is a different top level comment." [50 votes]
(05)-"This is a reply to the comment above." [4 votes]
(08)--"This is a reply to the comment above." [0 votes]
(10)---"This is a deeper comment still." [0 votes]
(11)----"THIS COMMENT WAS ADDED IN THE EARLIER EXAMPLE." [0 votes]
(04)"This is one more top level comment." [2 votes]
(09)-"This is one more reply." [0 votes]

In this example, comments (01) and (03) are both top-level but (03) has [50 votes] and (01) only has [6 votes]. (01) appears above only by virtue of the fact that it was posted earlier and therefore has been assigned a smaller ID. Likewise (02) and (06) are both replies to (01) but must be reordered to allow the one with the most votes (06) to rise to the top.

I am completely and utterly stuck in trying to achieve this.

I imagine that any ordering/reordering and indexing would be better done on a comment-vote being cast rather than on page load so that the page-load time can be as quick as possible but beyond that I have absolutely no idea!

Any ideas or light you could shed on possible avenues would really take a load off! Thanks for your help as always.

--------------------------------------------------------------------------------

Edit: In response to @Paddy's solution,

When I run the expression offered by @Paddy below on the mock data, the first error I get is:

"The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified." 

This can be remedied by adding SELECT 'top 100 percent' to the recursive member definition. Once this is done, I get the error:

'CommentTree' has more columns than were specified in the column list.

This can be resolved by adding a 'Level' column to the CommentTree specification. This then prints the data, but it returns all the top level comments first and then something resembling (but not actually matching) the correct sort order after.

The data is returned as such:

ParentId  |  CommentId  |  Comment  |  Vote  | Level
NULL      |      1      | Text here |   6    |  0
NULL      |      3      | Text here |   50   |  0     
NULL      |      4      | Text here |   2    |  0    
4         |      9      | Text here |   0    |  1    
3         |      5      | Text here |   4    |  1    
5         |      8      | Text here |   0    |  2    
8         |      10     | Text here |   0    |  3   
10        |      11     | Text here |   0    |  4    
1         |      2      | Text here |   2    |  1    
1         |      6      | Text here |   30   |  1     
6         |      7      | Text here |   5    |  2    

Have I done anything wrong or did @Paddy miss out something perhaps? Please accept my apologies, recursive functions are very new to me.

like image 357
Dom Vinyard Avatar asked Feb 08 '12 16:02

Dom Vinyard


People also ask

What makes a good comment thread?

Before diving into designing and writing code, let’s break down what actually makes a good comment thread. A good comment thread will have the following characteristics: Comments are readable, and all the important data-points are visible to the user at a glance. This includes the author, number of votes, timestamp, and the content.

What to do if there are no comments in a thread?

If a comment thread does not have any comments (empty state), we need to communicate that clearly to the user. A simple paragraph containing the line “There are no comments yet.” along with a form containing a text box and submit button can go a very long way, and should be the bare minimum when dealing with empty states.

How can I reduce spam when building comment threads?

If you give users a form to provide their inputs, you can guarantee that you will find spam coming your way, so this is obviously an issue to address when building comment threads. A great way to reduce spam is to use services like reCAPTCHA from Google.


2 Answers

Code below looks good for your task. It's a bit complex, but it was a challenge for me to make it in a single SELECT. You can split it into several SELECT with prefetch into temporary tables (for performance purposes), or keep it together.

Thank you for the question, it was interesting!

Please note that ParentID for root nodes must be 0, not NULL.

DECLARE @a TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT
)


INSERT @a
VALUES
    (1, 0, '', 6),
    (3, 0, '', 50),
    (4, 0, '', 2),
    (9, 4, '', 0),
    (5, 3, '', 4),
    (8, 5, '', 0),
    (10, 8, '', 0),
    (11, 10, '', 0),
    (2, 1, '', 2),
    (6, 1, '', 30),
    (7, 6, '', 5)

;WITH CTE_1 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, Path)    -- prepare base info
AS
(
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, 0 AS Level, ROW_NUMBER() OVER(ORDER BY c.Vote DESC), CAST('/' + CAST(c.CommentId AS VARCHAR(32)) AS VARCHAR(MAX)) + '/'
    FROM @a AS c
    WHERE ParentId = 0

    UNION ALL

    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, Level + 1 AS Level, ROW_NUMBER() OVER(ORDER BY c.Vote DESC), d.Path + CAST(c.CommentId AS VARCHAR(32)) + '/'
    FROM @a AS c
    INNER JOIN CTE_1 AS d
        ON c.ParentID = d.CommentID
),
CTE_2 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, ChildCount)    -- count number of children
AS
(
    SELECT p.ParentId, p.CommentId, p.Comment, p.Vote, p.Level, p.LevelPriority, COUNT(*)
    FROM CTE_1 AS p
    INNER JOIN CTE_1 AS c
        ON c.Path LIKE p.Path + '%'
    GROUP BY 
        p.ParentId, p.CommentId, p.Comment, p.Vote, p.Level, p.LevelPriority
),
CTE_3 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, OverAllPriority, ChildCount) -- calculate overall priorities
AS
(
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, c.Level, c.LevelPriority, 1 AS OverAllPriority, ChildCount
    FROM CTE_2 AS c
    WHERE Level = 0 AND LevelPriority = 1

    UNION ALL

    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, c.Level, c.LevelPriority, 
        CASE 
            WHEN c.ParentID = d.CommentID THEN d.OverAllPriority + 1
            ELSE d.OverAllPriority + d.ChildCount
        END,
        c.ChildCount
    FROM CTE_2 AS c
    INNER JOIN CTE_3 AS d
        ON 
            (c.ParentID = d.CommentID AND c.LevelPriority = 1) 
            OR (c.ParentID = d.ParentID AND d.LevelPriority + 1 = c.LevelPriority)
)
SELECT ParentId, CommentId, Comment, Vote
FROM CTE_3
ORDER BY OverAllPriority

In this query I do following:

  1. In CTE_1 I calculate ordering positions within same parent comment (based on votes) and build tree path to collect information about all nodes in the hierarchy.
  2. In CTE_2 I calculate number of descendants belonging to every node +1. Tree path allows to count descendants of all level by one SELECT.
  3. In CTE_3 I calculate overall ordering positions based on 3 simple rules:
    1. The topmost row has position = 1
    2. The upper child node has position = parent_position + 1
    3. The next sibling should go after all descendatns of previous one and has position = prev_sibling_position + prev_sibling_number_of_descendants

EDIT The same solution, but without CTE.

DECLARE @a TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT
)

INSERT @a
VALUES
    (1, 0, '', 6),
    (3, 0, '', 50),
    (4, 0, '', 2),
    (9, 4, '', 0),
    (5, 3, '', 4),
    (8, 5, '', 0),
    (10, 8, '', 0),
    (11, 10, '', 0),
    (2, 1, '', 2),
    (6, 1, '', 30),
    (7, 6, '', 5)


DECLARE @rows INT

DECLARE @temp_table TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT,
    LevelPriority INT, 
    Path VARCHAR(MAX),
    ChildCount INT NULL,
    OverAllPriority INT NULL
)

INSERT @temp_table (CommentID, ParentID, Comment, Vote, LevelPriority, Path)
SELECT CommentID, ParentID, Comment, Vote, ROW_NUMBER() OVER(ORDER BY Vote DESC), '/' + CAST(CommentId AS VARCHAR(32)) + '/'
FROM @a
WHERE ParentID = 0

SELECT @rows = @@ROWCOUNT

WHILE @rows > 0
BEGIN

    INSERT @temp_table (CommentID, ParentID, Comment, Vote, LevelPriority, Path)
    SELECT a.CommentID, a.ParentID, a.Comment, a.Vote, ROW_NUMBER() OVER(PARTITION BY a.ParentID ORDER BY a.Vote DESC), c.Path + CAST(a.CommentId AS VARCHAR(32)) + '/'
    FROM @a AS a
    INNER JOIN @temp_table AS c
        ON a.ParentID = c.CommentID
    WHERE NOT
        a.CommentID IN (SELECT CommentID FROM @temp_table)  

    SELECT @rows = @@ROWCOUNT
END

UPDATE c
SET ChildCount = a.cnt
FROM (
    SELECT p.CommentID, COUNT(*) AS cnt 
    FROM @temp_table AS p
    INNER JOIN @temp_table AS c
        ON c.Path LIKE p.Path + '%'
    GROUP BY 
        p.CommentID
) AS a
INNER JOIN @temp_table AS c
    ON a.CommentID = c.CommentID

UPDATE @temp_table
SET OverAllPriority = 1
WHERE ParentID = 0 AND LevelPriority = 1

SELECT @rows = @@ROWCOUNT

WHILE @rows > 0
BEGIN

    UPDATE c
    SET 
        OverAllPriority = CASE 
            WHEN c.ParentID = p.CommentID THEN p.OverAllPriority + 1
            ELSE p.OverAllPriority + p.ChildCount
        END
    FROM @temp_table AS p
    INNER JOIN @temp_table AS c
        ON (c.ParentID = p.CommentID AND c.LevelPriority = 1) 
            OR (p.ParentID = c.ParentID AND p.LevelPriority + 1 = c.LevelPriority)
    WHERE
        c.OverAllPriority IS NULL  
        AND p.OverAllPriority IS NOT NULL

    SELECT @rows = @@ROWCOUNT
END


SELECT * FROM @temp_table 
ORDER BY OverAllPriority
like image 125
Andrey Gurinov Avatar answered Sep 30 '22 17:09

Andrey Gurinov


Although not directly related to your question, my advice would be to change to the Nested Set Model. I know it is a lot of rework but sooner or later you'll realise it is the best choice :)

like image 29
Mosty Mostacho Avatar answered Sep 30 '22 15:09

Mosty Mostacho