I am implementing web application similar to Twitter. I need to implement 'retweet' action, and one tweet can by retweeted by one person multiple times.
I have a basic 'tweets' table that have columns for:
Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id
(where tweet_id
is primary key for tweets, tweet_text
contains tweet text, tweet_date_created
is the DateTime when tweet was created and tweet_user_id
is the foreign key to users
table and identifies user who has created the tweet)
Now I am wondering how should I implement the retweet action in my database.
Should I create new join table, which would look like this:
Retweets: tweet_id | user_id | retweet_date_retweeted
(Where tweet_id
is a foreign key to tweets
table, user_id
is a foreign key to users
table and identifies user who has retweeted the tweet, retweet_date_retweeted
is a DateTime which specifies when the retweet was done.)
pros: There will be no empty columns, when user process reteet, new line in retweets
table will be created.
cons: Querying process will be more difficult, it will need to join two tables and somehow sort the tweets by two dates (when tweet is not retweet, sort it by tweet_date_created, when tweet is retweet, sort it by retweet_date_retweeted).
Or should I implement it in the tweets
table as parent_id
, it will then look like this:
Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id | parent_id
(Where all the columns remains the same and parent_id
is a foreign key to the same tweets
table. When tweet is created, parent_id
remains empty. When tweet is retweeted, parent_id
contains origin tweet id, tweet_user_id
contains user which processed the retweet action, tweet_date_created
contains the DateTime when retweet was done, and tweet_text
remains empty - becouse we will not let users change the original tweet when retweeting.)
pros: Querying process is much more elegant, as I do not have to join two tables.
cons: There will be empty cells every time tweet is retweeted. So if I have 1 000 tweets in my database and every of them is retweeted for 5 times, there will be 5 000 lines in my tweets
table.
Which is the most efficient way? Is it better to have empty cells or to have querying process more clean?
IMO option #1 would be better. The query to join the tweet and retweet tables would not be at all complex and could be done via a left or inner join, depending on whether you want to show all tweets or only tweets which were retweeted. And the join query should be performant as the table is narrow, the columns being joined are ints, and they will each have indices due to the FK constraints.
Another recommendation is not to label all your columns with tweet or retweet, those can be inferred from the table in which the data is stored, for example:
tweet
id
user_id
text
created_at
retweet
tweet_id
user_id
created_at
And sample joins:
# Return all tweets which have been retweeted
SELECT
count(*),
t.id
FROM
tweet AS t
INNER JOIN retweet AS rt ON rt.tweet_id = t.id
GROUP BY
t.id
# Return tweet and possible retweet data for a specific tweet
SELECT
t.id
FROM
tweet AS t
LEFT OUTER JOIN retweet AS rt ON rt.tweet_id = t.id
WHERE
t.id = :tweetId
-- Update per request --
The following is demonstrative only, representing why I would opt for option #1, there are no foreign keys nor are there any indices, you will have to add these yourself. But the results should demonstrate that the joins won't be too painful.
CREATE TABLE `tweet` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`value` varchar(255) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8
CREATE TABLE `retweet` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`tweet_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
# Sample Rows
mysql> select * from tweet;
+----+---------+----------------+---------------------+
| id | user_id | value | created_at |
+----+---------+----------------+---------------------+
| 1 | 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
| 2 | 1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 3 | 2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
| 4 | 3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
| 5 | 1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 6 | 1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 7 | 1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
+----+---------+----------------+---------------------+
mysql> select * from retweet;
+----+----------+---------+---------------------+
| id | tweet_id | user_id | created_at |
+----+----------+---------+---------------------+
| 1 | 4 | 1 | 2012-07-27 00:06:37 |
| 2 | 3 | 1 | 2012-07-27 00:07:11 |
+----+----------+---------+---------------------+
# Query to pull all tweets for user_id = 1, including retweets and order from newest to oldest
select * from (
select t.* from tweet as t where user_id = 1
union
select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1))
a order by created_at desc;
mysql> select * from (select t.* from tweet as t where user_id = 1 union select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1)) a order by created_at desc;
+----+---------+----------------+---------------------+
| id | user_id | value | created_at |
+----+---------+----------------+---------------------+
| 7 | 1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
| 6 | 1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 5 | 1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 4 | 3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
| 3 | 2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
| 2 | 1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 1 | 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+---------+----------------+---------------------+
Notice in the last set of results, that we were able to also include the retweets and display the retweet of #4 before the retweet of #3.
-- Update --
You can accomplish what you are asking for by changing the query a bit:
select * from (
select t.id, t.value, t.created_at from tweet as t where user_id = 1
union
select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1)
a order by created_at desc;
mysql> select * from (select t.id, t.value, t.created_at from tweet as t where user_id = 1 union select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1) a order by created_at desc;
+----+----------------+---------------------+
| id | value | created_at |
+----+----------------+---------------------+
| 3 | User2 | Tweet1 | 2012-07-27 00:07:11 |
| 7 | User1 | Tweet5 | 2012-07-27 00:06:54 |
| 6 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 5 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 4 | User3 | Tweet1 | 2012-07-27 00:06:37 |
| 2 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+----------------+---------------------+
I would choose option 2 with slight modification. Column parent_id
in tweets table should point to itself if it is not a retweet. Then, the querying will be extremely easy:
SELECT tm.Id, tm.UserId, tc.Text, tm.Created,
CASE WHEN tm.Id <> tc .Id THEN tm.UserId ELSE NULL END AS OriginalAsker
FROM tweet tm
LEFT JOIN tweet tc ON tm.ParentId = tc.Id
ORDER BY tm.Created DESC
(tc
is parent table - the one with content.. it has tweet's text, original poster's Id, etc.)
The reason for introducing rule about pointing to itself if not retweet is that then it is easy to add more joins to original tweet. You just join a table with tc
and don't care if it is retweet or not.
Not only the query is easy, but it will also perform much better than option 1, because sorting is done using only one physical column, which can be indexed.
The only drawback is that the DB will be a little bit larger.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With