Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regression analysis in MySQL

Introduction
in my project I'm saving FacebookPages and their like count, as well as the like count per country. I have a table for the FacebookPages, one for the languages, one for the correlation between the facebook page and the language (and counting the likes) and one table which saves this data as a history. What I want to do, is to get the page with the strongest increase in likes over a specific time period.

Data to work with

I'm stripping the non relevant information from the create queries.

Table containing all facebook pages

CREATE TABLE `pages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_likes` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data:

INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
    (1, '552825254796051', 'Mesut Özil', 28593755),
    (2, '134904013188254', 'Borussia Dortmund', 13213354),
    (3, '310111039010406', 'Marco Reus', 12799627);

Table containing all languages

CREATE TABLE `languages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `languages` (`id`, `language`)
VALUES
    (1, 'ID'),
    (2, 'TR'),
    (3, 'BR');

Table containing the correlation

CREATE TABLE `language_page_likes` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_id` int(10) unsigned NOT NULL,
  `facebook_page_id` int(10) unsigned NOT NULL,
  `likes` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
    (1, 1, 1),
    (2, 2, 1),
    (3, 3, 1),
    (47, 3, 2),
    (51, 1, 2),
    (53, 2, 2),
    (92, 3, 3),
    (95, 2, 3),
    (97, 1, 3);

Table containing the history

CREATE TABLE `language_page_likes_history` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_page_likes_id` int(10) unsigned NOT NULL,
  `likes` int(11) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
    (1, 1, 3272484, '2015-09-11 08:40:23'),
    (132014, 1, 3272827, '2015-09-14 08:31:00'),
    (2, 2, 1581361, '2015-09-11 08:40:23'),
    (132015, 2, 1580392, '2015-09-14 08:31:00'),
    (3, 3, 1467090, '2015-09-11 08:40:23'),
    (132016, 3, 1467329, '2015-09-14 08:31:00'),
    (47, 47, 828736, '2015-09-11 08:40:23'),
    (132060, 47, 828971, '2015-09-14 08:31:00'),
    (51, 51, 602747, '2015-09-11 08:40:23'),
    (132064, 51, 603071, '2015-09-14 08:31:00'),
    (53, 53, 545484, '2015-09-11 08:40:23'),
    (132066, 53, 545092, '2015-09-14 08:31:00'),
    (92, 92, 916570, '2015-09-11 08:40:24'),
    (132105, 92, 917032, '2015-09-14 08:31:01'),
    (95, 95, 537382, '2015-09-11 08:40:24'),
    (132108, 95, 537395, '2015-09-14 08:31:01'),
    (97, 97, 419175, '2015-09-11 08:40:24'),
    (132110, 97, 419484, '2015-09-14 08:31:01');

As you can see, I got data for the 14th and 11th September. Now I want to get the site, with the biggest increase in likes. Before I have done it with a column called last_like_count, but the problem is, that I can't be dynamic in the date range. With a "normal" regression function I could be dynamic for every date range.

Solution finding
What I already managed to do, was to build all the relationships that are present

SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68

With that query I'm getting every likecount in the history of the system for specific languages. But, how would I build regression analysis into that part?

I already found this link here

Identifying trend with SQL query

but my math and MySQL skills aren't high enough to translate the SQL into MySQL. Any help?

like image 836
Musterknabe Avatar asked Sep 14 '15 10:09

Musterknabe


2 Answers

This might be, what you are looking for:

SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
       PageId, LanguageId
FROM
(
SELECT Q0.Y, 
       Q0.X, 
       Q1.AVG_Y,
       Q1.AVG_X,
       Q1.PageId,
       Q1.LanguageId
FROM   (SELECT T0.likes AS Y,
               UNIX_TIMESTAMP(T0.created_at) AS X,
               T1.facebook_page_id AS PageId,
               T1.language_id AS LanguageId
        FROM   language_page_likes_history T0 INNER JOIN
               language_page_likes T1 ON 
               (T0.language_page_likes_id = T1.id)
        WHERE  T0.created_at > '2015-09-11 00:00:00' AND
               T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
       (SELECT AVG(T2.likes) AS AVG_Y,
               AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
               T3.facebook_page_id AS PageId,
               T3.language_id AS LanguageId
        FROM   language_page_likes_history T2 INNER JOIN
               language_page_likes T3 ON 
               (T2.language_page_likes_id = T3.id)
        WHERE  T2.created_at > '2015-09-11 00:00:00' AND
               T2.created_at < '2015-09-15 00:00:00'
        GROUP BY T3.facebook_page_id, T3.language_id) Q1
        ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC

It returns the slope of the linear regression per Page and Language. The column Slope represents the number of likes per second. In your sample data the amount of likes decreases for two cases. I don't know why. The output should look like this. The SQL statement is tested and I checked two row calculations manually for correct output.

|           Slope | PageId | LanguageId |
|-----------------|--------|------------|
|  0.001786287345 |      3 |          3 |
|  0.001326183029 |      1 |          1 |
|  0.001252720995 |      2 |          1 |
|  0.001194724653 |      3 |          1 |
|  0.000924075055 |      1 |          3 |
|  0.000908609364 |      2 |          3 |
|  0.000050263497 |      3 |          2 |
| -0.001515637747 |      2 |          2 |
| -0.003746563717 |      1 |          2 |

There might be a problem, if there is no data in the tables. So maybe ISNULL-checks have to be added.


When you only want to know the absolute values it is simpler. You can take following statement:

SELECT PageId, LanguageId,
       (likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
       T1.language_id AS LanguageId,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at < '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_before_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at >= '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at <= '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at > '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_after_period

        FROM   language_page_likes T1) Q0
ORDER BY Likes DESC

which has 4 sub-queries. Only two are needed which you have to choose. I have choosen to use the number of likes just before the period and the last number of likes which is in the period to calculate the difference. The result looks like this:

| PageId | LanguageId | Likes |
|--------|------------|-------|
|      3 |          3 |   462 |
|      1 |          1 |   343 |
|      2 |          1 |   324 |
|      3 |          1 |   309 |
|      1 |          3 |   239 |
|      2 |          3 |   235 |
|      3 |          2 |    13 |
|      2 |          2 |  -392 |
|      1 |          2 |  -969 |
like image 90
user1027167 Avatar answered Nov 08 '22 11:11

user1027167


This was what I could came up with right now. I can't properly test this query, because now I don't have the time to create these table structures in one of the web's sql test pages. But I think even if it doesn't work initially it can point you in the right direction.

select 
    id, 
    new_date,
    max(increase)
from (
select 
    dg.id, 
    dg.date new_date, 
    dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select 
    language_pages_likes_id id,
    date_format(created_at, '%Y%-m$-%d') date,
    sum(likes) likes_sum
from
    language_page_likes_history lplh
group by
    language_page_likes_id,
    date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases

Hope it helps. Later, when I can I will further test and improve this query.

like image 30
Nelson Teixeira Avatar answered Nov 08 '22 13:11

Nelson Teixeira