I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the following script:
from random import random
def func(in_vals):
'''result = x (+y+z+w....)'''
result = 0
for v in in_vals:
result += v
return result
if __name__ == "__main__":
entry_count = 1000
dim_count = 2
in_vals = [0]*dim_count
with open("data_yequalsx.csv", "w") as out_file:
for entry in range(entry_count):
for i in range(dim_count):
in_vals[i] = random()
out_val = func(in_vals)
out_file.write(','.join([str(x) for x in in_vals]))
out_file.write(",%s\n" % str(out_val))
I then ran the following Scikit script:
import sklearn
from sklearn import linear_model
import numpy as np
data = []
target = []
with open("data_yequalsx.csv") as inFile:
for row in inFile:
vals = row.split(",")
data.append([float(x) for x in vals[:-1]])
target.append(float(vals[-1]))
test_samples= len(data)/10
train_data = [0]*(len(data) - test_samples)
train_target = [0]*(len(data) - test_samples)
test_data = [0]*(test_samples)
test_target = [0]*(test_samples)
train_index = 0
test_index = 0
for j in range(len(data)):
if j >= test_samples:
train_data[train_index] = data[j]
train_target[train_index] = target[j]
train_index += 1
else:
test_data[test_index] = data[j]
test_target[test_index] = target[j]
test_index += 1
model = linear_model.SGDRegressor(n_iter=100, learning_rate="invscaling", eta0=0.0001, power_t=0.5, penalty="l2", alpha=0.0001, loss="squared_loss")
model.fit(train_data, train_target)
print(model.coef_)
print(model.intercept_)
result = model.predict(test_data)
mse = np.mean((result - test_target) ** 2)
print("Mean Squared Error = %s" % str(mse))
And then this Spark script: (with spark-submit , no other arguments)
from pyspark.mllib.regression import LinearRegressionWithSGD, LabeledPoint
from pyspark import SparkContext
sc = SparkContext (appName="mllib_simple_accuracy")
raw_data = sc.textFile ("data_yequalsx.csv", minPartitions=10) #MinPartitions doesnt guarantee that you get that many partitions, just that you wont have fewer than that many partitions
data = raw_data.map(lambda line: [float(x) for x in line.split (",")]).map(lambda entry: LabeledPoint (entry[-1], entry[:-1])).zipWithIndex()
test_samples= data.count()/10
training_data = data.filter(lambda (entry, index): index >= test_samples).map(lambda (lp,index): lp)
test_data = data.filter(lambda (entry, index): index < test_samples).map(lambda (lp,index): lp)
model = LinearRegressionWithSGD.train(training_data, step=0.01, iterations=100, regType="l2", regParam=0.0001, intercept=True)
print(model._coeff)
print(model._intercept)
mse = (test_data.map(lambda lp: (lp.label - model.predict(lp.features))**2 ).reduce(lambda x,y: x+y))/test_samples;
print("Mean Squared Error: %s" % str(mse))
sc.stop ()
Strangely though, the error given by spark is an order of magnitude larger than that given by Scikit (0.185 and 0.045 respectively) despite the two models having a nearly identical setup (as far as I can tell) I understand that this is using SGD with very few iterations and so the results may differ but I wouldn't have thought that it would be anywhere near such a large difference or such a large error, especially given the exceptionally simple data.
Is there something I'm misunderstanding in Spark? Is it not correctly configured? Surely I should be getting a smaller error than that?
The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).
Before training a model with either Amazon SageMaker built-in algorithms or custom algorithms, you can use Spark and scikit-learn preprocessors to transform your data and engineer features.
To help solve this problem, Spark provides a general machine learning library -- MLlib -- that is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can solve and iterate through their data problems faster.
Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.
SGD, which stands for Stochastic Gradient Descent, is an online convex optimization algorithm, and therefore very difficult to parallelize, since it makes one update per iteration (there are smarter variantes such as SGD with mini-batches, but still not very good for parallel environment.
On the other hand, batch algorithms, such as L-BFGS, hwich I advise you to use with Spark (LogigisticRegressionWithLBFGS), can be easily parallelized, since it makes on iteration per epoch (it need to see all datapoints, calculate the value and gradient of the loss function of each point, then performs the aggregation to calculate the full gradient).
Python is ran in a single machine, therefore SGD performs well.
By the way, if you look in MLlib code, the equivalent of scikit learn's lambda is lambda/size of the dataset (mllib optimizes 1/n*sum(l_i(x_i,f(y_i)) + lambda
while scikit learn optimizes sum(l_i(x_i,f(y_i)) + lambda
Because Spark is parallelized, each node needs to be able to work independently of the other nodes when the computation is under-way to avoid [time-]expensive shuffles between the nodes. Consequently, it uses a procedure called Stochastic Gradient Descent to approach a minimum, which follows local gradients downwards.
The 'exact' way to solve a [simple, least-squares] regression problem involves solving a matrix equation. This is probably what Scikit-Learn is doing, so in this case it will be more accurate.
The trade-off is that solving matrix equations generally scales as N^3 for a size-N square matrix, which rapidly becomes unfeasible for large datasets. Spark swaps accuracy for computational power. As with any machine-learning procedure, you should build in LOTS of sanity checks throughout your algorithms to make sure that the results of the previous step make any sense.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With