Introduction
in my project I'm saving FacebookPages and their like count, as well as the like count per country. I have a table for the FacebookPages, one for the languages, one for the correlation between the facebook page and the language (and counting the likes) and one table which saves this data as a history. What I want to do, is to get the page with the strongest increase in likes over a specific time period.
Data to work with
I'm stripping the non relevant information from the create queries.
Table containing all facebook pages
CREATE TABLE `pages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_likes` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Example data:
INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
(1, '552825254796051', 'Mesut Özil', 28593755),
(2, '134904013188254', 'Borussia Dortmund', 13213354),
(3, '310111039010406', 'Marco Reus', 12799627);
Table containing all languages
CREATE TABLE `languages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Example data
INSERT INTO `languages` (`id`, `language`)
VALUES
(1, 'ID'),
(2, 'TR'),
(3, 'BR');
Table containing the correlation
CREATE TABLE `language_page_likes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_id` int(10) unsigned NOT NULL,
`facebook_page_id` int(10) unsigned NOT NULL,
`likes` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Example data
INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
(1, 1, 1),
(2, 2, 1),
(3, 3, 1),
(47, 3, 2),
(51, 1, 2),
(53, 2, 2),
(92, 3, 3),
(95, 2, 3),
(97, 1, 3);
Table containing the history
CREATE TABLE `language_page_likes_history` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_page_likes_id` int(10) unsigned NOT NULL,
`likes` int(11) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Example data
INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
(1, 1, 3272484, '2015-09-11 08:40:23'),
(132014, 1, 3272827, '2015-09-14 08:31:00'),
(2, 2, 1581361, '2015-09-11 08:40:23'),
(132015, 2, 1580392, '2015-09-14 08:31:00'),
(3, 3, 1467090, '2015-09-11 08:40:23'),
(132016, 3, 1467329, '2015-09-14 08:31:00'),
(47, 47, 828736, '2015-09-11 08:40:23'),
(132060, 47, 828971, '2015-09-14 08:31:00'),
(51, 51, 602747, '2015-09-11 08:40:23'),
(132064, 51, 603071, '2015-09-14 08:31:00'),
(53, 53, 545484, '2015-09-11 08:40:23'),
(132066, 53, 545092, '2015-09-14 08:31:00'),
(92, 92, 916570, '2015-09-11 08:40:24'),
(132105, 92, 917032, '2015-09-14 08:31:01'),
(95, 95, 537382, '2015-09-11 08:40:24'),
(132108, 95, 537395, '2015-09-14 08:31:01'),
(97, 97, 419175, '2015-09-11 08:40:24'),
(132110, 97, 419484, '2015-09-14 08:31:01');
As you can see, I got data for the 14th and 11th September. Now I want to get the site, with the biggest increase in likes. Before I have done it with a column called last_like_count, but the problem is, that I can't be dynamic in the date range. With a "normal" regression function I could be dynamic for every date range.
Solution finding
What I already managed to do, was to build all the relationships that are present
SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68
With that query I'm getting every likecount in the history of the system for specific languages. But, how would I build regression analysis into that part?
I already found this link here
Identifying trend with SQL query
but my math and MySQL skills aren't high enough to translate the SQL into MySQL. Any help?
This might be, what you are looking for:
SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
PageId, LanguageId
FROM
(
SELECT Q0.Y,
Q0.X,
Q1.AVG_Y,
Q1.AVG_X,
Q1.PageId,
Q1.LanguageId
FROM (SELECT T0.likes AS Y,
UNIX_TIMESTAMP(T0.created_at) AS X,
T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId
FROM language_page_likes_history T0 INNER JOIN
language_page_likes T1 ON
(T0.language_page_likes_id = T1.id)
WHERE T0.created_at > '2015-09-11 00:00:00' AND
T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
(SELECT AVG(T2.likes) AS AVG_Y,
AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
T3.facebook_page_id AS PageId,
T3.language_id AS LanguageId
FROM language_page_likes_history T2 INNER JOIN
language_page_likes T3 ON
(T2.language_page_likes_id = T3.id)
WHERE T2.created_at > '2015-09-11 00:00:00' AND
T2.created_at < '2015-09-15 00:00:00'
GROUP BY T3.facebook_page_id, T3.language_id) Q1
ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC
It returns the slope of the linear regression per Page and Language. The column Slope represents the number of likes per second. In your sample data the amount of likes decreases for two cases. I don't know why. The output should look like this. The SQL statement is tested and I checked two row calculations manually for correct output.
| Slope | PageId | LanguageId |
|-----------------|--------|------------|
| 0.001786287345 | 3 | 3 |
| 0.001326183029 | 1 | 1 |
| 0.001252720995 | 2 | 1 |
| 0.001194724653 | 3 | 1 |
| 0.000924075055 | 1 | 3 |
| 0.000908609364 | 2 | 3 |
| 0.000050263497 | 3 | 2 |
| -0.001515637747 | 2 | 2 |
| -0.003746563717 | 1 | 2 |
There might be a problem, if there is no data in the tables. So maybe ISNULL-checks have to be added.
When you only want to know the absolute values it is simpler. You can take following statement:
SELECT PageId, LanguageId,
(likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId,
(SELECT likes
FROM language_page_likes_history
WHERE created_at < '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_before_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at >= '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at <= '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at > '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_after_period
FROM language_page_likes T1) Q0
ORDER BY Likes DESC
which has 4 sub-queries. Only two are needed which you have to choose. I have choosen to use the number of likes just before the period and the last number of likes which is in the period to calculate the difference. The result looks like this:
| PageId | LanguageId | Likes |
|--------|------------|-------|
| 3 | 3 | 462 |
| 1 | 1 | 343 |
| 2 | 1 | 324 |
| 3 | 1 | 309 |
| 1 | 3 | 239 |
| 2 | 3 | 235 |
| 3 | 2 | 13 |
| 2 | 2 | -392 |
| 1 | 2 | -969 |
This was what I could came up with right now. I can't properly test this query, because now I don't have the time to create these table structures in one of the web's sql test pages. But I think even if it doesn't work initially it can point you in the right direction.
select
id,
new_date,
max(increase)
from (
select
dg.id,
dg.date new_date,
dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select
language_pages_likes_id id,
date_format(created_at, '%Y%-m$-%d') date,
sum(likes) likes_sum
from
language_page_likes_history lplh
group by
language_page_likes_id,
date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases
Hope it helps. Later, when I can I will further test and improve this query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With