I am trying to predict the trend of an internet post.
I have available the number of comments and votes the post has after 2 minutes of being posted (can change, but it should be enough).
Currently I use this formula:
predicted_votes = (votes_per_minute + n_comments * 60 * h) * k
And then I find k
experimentally. I get the post data, wait an hour, do
k = (older_k + actual_votes/predicted_votes) / 2
And so on. This kind of works. The accuracy is pretty low (40 - 50%), but it gives me a rough idea on how the post is going to react.
I was wondering if I could employ a more complex equation, something like:
predicted_votes = ((votes_per_minute * x + n_comments * y) * 60 * hour) * k # Hour stands for 'how many hours to predict'
And then optimize the parameters to approximate a bit better.
I would assume that I could use Machine Learning, although I don't have a GPU available (that's right, I'm running on integrated graphics, blame Mojave), so I am trying this approach instead.
So the question boils down to, how do I optimize those parameters (k,x,y
) to get a better accuracy?
EDIT:
I tried following what @Alexis said, and this is where I am at right now:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
def func(x, k, t, s):
votes_per_minute = x[0]
n_comments = x[1]
return ((votes_per_minute * t + n_comments * s) * 60) * k
xdata = [1.41,0]
y = func(xdata, 2.5, 1.3, 0.5)
np.random.seed(1729)
ydata = y + 5
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'g--',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('Time')
plt.ylabel('Score')
plt.legend()
plt.show()
I am not sure how to feed the data I have (votes_per_minute, n_comments), nor how I could tell the algorithm that y
axis is actually time.
EDIT 2:
Tried doing what @Alexis told me, but I am unsure what to use as actual_score
, a number doesn't work, a list neither.. Also, I want to predict the 'score' not the number of comments.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
final_score = [26,12,13,14,229]
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = [[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]
y = actual_votes # What is this?
popt, pcov = curve_fit(func, X, y)
plt.plot(xdata, func(xdata, *popt), 'g--',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('Time')
plt.ylabel('Score')
plt.legend()
plt.show()
An optimizer is a function or an algorithm that modifies the attributes of the neural network, such as weights and learning rate. Thus, it helps in reducing the overall loss and improve the accuracy.
Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm on the training dataset. The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be framed as an optimization problem.
Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization problems by minimizing the function.
you don't need ML to do so (overkill i think here). Scipy provides a nice and easy way to fit a curve to the observations you have.
scipy.optimize.curve_fit allows you to fit a function with unknown parameters to your observation. As you already know the general form of the function, optimizing the hyper parameters is a well known stat problem and thus scipy should be enough.
We can take a small example to demonstrate this: first we generate the datas
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from scipy.optimize import curve_fit
>>>
>>> def func(x, a, b, c):
... return a * np.exp(-b * x) + c
Define the data to be fit with some noise:
>>> xdata = np.linspace(0, 4, 50)
>>> y = func(xdata, 2.5, 1.3, 0.5)
>>> np.random.seed(1729)
>>> y_noise = 0.2 * np.random.normal(size=xdata.size)
>>> ydata = y + y_noise
>>> plt.plot(xdata, ydata, 'b-', label='data')
then we fit the function (ax+b=y) to the data using scipy:
popt, pcov = curve_fit(func, xdata, ydata)
you could add constraints to this, but for your problem it is not necessary. By the way, this example is at the end of the link i provided. Everything you should know to use the curve fit is available on this page.
Edit
it seems you have a hard time figuring out how to use this. Let's go slowly and analytically to make sure we are ok every step of the way:
y
. It is known. not calculatedvotes_per_minute
, the n_comments
and the hour h(x,y,k)
so X[i]
(one sample) should look like this: [votes_per_minute,n_comments,h]
and with your formula y = ((votes_per_minute * k + n_comments * t) * 60 * h) * s, by replacing the names:
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = np.array([[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]).T
y = score
and then:
popt, pcov = curve_fit(func, X, y)
(if i understand your issue...if not, i don't see where the problem is)
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
final_score = [26,12,13,14,229]
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = np.array([[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]).T
y = [0.12,0.20,0.5,0.9,1]
popt, pcov = curve_fit(func, X, y)
print(popt)
>>>[-6.65969099e+00 -6.99241803e-02 -9.33412000e-04]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With