Using the Stack Overflow public data dump, I've created three simple tables:
The Questions table has hundreds of thousands of rows with Creation_Date spanning from a year ago to today. Looking over the data, there are two notable trends:
If neither of these factors came in to play, it'd be fairly trivial to estimate traffic for a given tag (or group of tags) over a month:
SELECT YEAR(Q.Creation_Date)
,MONTH(Q.Creation_Date)
,SUM( Q.View_Count / DATEDIFF(m,Q.Creation_Date,GETDATE()) )
FROM Questions Q
JOIN QuestionTags QT
ON Q.Question_Id = QT.Question_Id
WHERE QT.Tag_Name IN ('c#','.net', ... )
GROUP BY YEAR(Q.Creation_Date), MONTH(Q.Creation_Date)
ORDER BY 1,2
But because of aforementioned factors (especially the long tail), I'm not sure how to approximate views. My thoughts are to create a function that, using the long tail formula, will calculates views for a month based on the current number of views and weeks open.
Here's what I came up with to find the tail:
DECLARE @SDTE DATETIME, @EDTE DATETIME
SELECT @SDTE = '2009-01-11' -- after new years holiday
,@EDTE = CAST( MAX([Creation_Date]) AS INT )
FROM [Questions]
SELECT [DaysOpen_Count]
,AVG( [WView_Count] )
FROM
(
SELECT QT.[Tag_Name],
Q.[View_Count],
[DaysOpen_Count] = DATEDIFF(DAY, Q.[Creation_Date], @EDTE),
[WView_Count] = CAST( Q.[View_Count] / ( DATEDIFF(DAY, Q.[Creation_Date], @EDTE) / 7.0 ) AS INT )
FROM [Questions] Q
INNER JOIN [QuestionTags] QT
ON Q.[Question_Id] = QT.[Question_Id]
WHERE [Tag_Name] IN ('c#','.net',...)
AND [Creation_Date] < @EDTE
) Q
GROUP BY [DaysOpen_Count]
ORDER BY 1,2
How should I proceed to create this SQL query?
The end goal is a Stored PRocedure that inputs a CSV-string of tags and spits out the past six month's page views for those tags.
UPDATE After "earning" the tumbleweed badge, I figured it was time for a bounty!
You'll need to consider an exponential Views decay curve, something similar to this - http://en.wikipedia.org/wiki/Exponential_decay
What we need here is the area under the curve upto desired time (in days).
If you do the math, you'll come to a result
Views = V/λ[1 - e^(-λt)]
t is (date created - today's date - 1)
V is the view count we have
λ can be 2ln2/T or 1.4/T
T can be a major life time like 5 days or 7 days. Lets take it 5.
We're making a lot of assumptions here cuz of the dynamic nature of SO. But I'm positive that it yields nice results.
All you have to do now is substitute the appropriate values and get views.
I thought of this method for estimating the tail:
for a list of tags, for each question in these tags give the 1st month after creation_date 80% of View_Count give the 2nd month after creation_date 10% of View_Count split 10% equally between the remaining months until today
of course 80%, 10% is just a choice of mine, they can be calculated more precisely based on real data. Also, the second month 10% may be eliminated. All that logic is in the: CASE WHEN diff ...
part.
you obtain estimated view_count/question/month
then all you have to do is sum view_count per month and if you want a window of time add a condition on the month
I created a stored procedure that can do this, but you have to create first a temporary table #tags(Tag_name) where you put the desired tags.
CREATE PROCEDURE GetTagViews @startDate datetime, @endDate datetime
As
IF exists (SELECT null FROM sysobjects WHERE name = '#months' and type = 'U')
DROP TABLE #MONTHS
CREATE TABLE #MONTHS
(
month datetime
)
DECLARE @currMonth datetime
SELECT @currMonth = MIN(Creation_Date) FROM Questions
-- Populate #MONTHS with all the months from the oldest
-- question creation_date to Today
WHILE @currMonth < getdate()
BEGIN
-- insert date starting at the beginning og the month
INSERT INTO #MONTHS select @currMonth - day(@currMonth) + 1
SELECT @currMonth = dateadd(m, 1, @currMonth) -- advance 1 month
END
SELECT YEAR(month) y, MONTH(month) m, SUM(curr_month_views) Views FROM (
SELECT Q1.month, Q1.diff, round(
CASE WHEN diff = dmin and diff = dmax THEN View_Count
WHEN diff = dmin and diff < dmax THEN 0.8*View_Count
WHEN diff = dmin+1 and diff < dmax THEN 0.1*View_Count
WHEN diff = dmin+1 and diff = dmax THEN 0.2*View_Count
WHEN diff >= dmin+2 THEN 0.1/(dmax - (dmin+2) + 1)*View_Count
ELSE 0
END, 0) curr_month_views
FROM (
SELECT Q.question_id, m.month,
DATEDIFF(m, Q.Creation_Date, m.month) diff,
Q.View_Count, dmin, dmax
FROM Questions Q,
#MONTHS m,
(SELECT MIN(DATEDIFF(m, Q.Creation_Date, m.month)) [dmin],
MAX(DATEDIFF(m, Q.Creation_Date, m.month)) [dmax]
FROM Questions Q,#MONTHS m
WHERE DATEDIFF(m, Q.Creation_Date, m.month) >= 0) MINMAX
) Q1 join QuestionTags QT on Q1.question_id = QT.question_id
join #tags on #tags.Tag_Name = QT.Tag_Name
) b WHERE month >= @startDate - day(@startDate) + 1
AND month <= @enddate - day(@enddate) + 1
GROUP BY Year(month), Month(month)
ORDER BY 1, 2
If I run this procedure with the following data:
Question_Id View_Count Creation_Date tag_name
----------- ----------- ------------------------------ ----------
0 42 2009-09-10 00:00:00.000 sql
1 326 2008-08-04 00:00:00.000 sql
2 377 2008-08-04 00:00:00.000 sql
3 568 2008-08-03 00:00:00.000 sql
4 839 2008-08-01 00:00:00.000 sql
5 228 2009-03-01 00:00:00.000 sql
6 178 2009-03-11 00:00:00.000 sql
7 348 2009-08-11 00:00:00.000 c#
populate #tags with 'sql'
GetTagViews '20090501', '20091001'
y m Views
----------- ----------- ---------------
2009 5 21.000000000000
2009 6 21.000000000000
2009 7 21.000000000000
2009 8 21.000000000000
2009 9 55.000000000000
populate #tags with 'c#'
GetTagViews '20090501', '20091001'
y m Views
----------- ----------- ----------------------------------------
2009 5 .000000000000
2009 6 .000000000000
2009 7 .000000000000
2009 8 278.000000000000
2009 9 35.000000000000
populate #tags with both 'sql' & 'c#'
GetTagViews '20090501', '20091001'
y m Views
----------- ----------- ----------------
2009 5 21.000000000000
2009 6 21.000000000000
2009 7 21.000000000000
2009 8 299.000000000000
2009 9 90.000000000000
(you see that peak for (sql, c#) comparing to only (sql) for 2009-08, it's due to the c# question being asked that month.)
N.B.: the rouding of estimates my lead to a difference of some views (~1) if you sum up detailed views and compare to the original data for a given question!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With