Is Apache Spark less accurate than Scikit Learn?

Tags:

I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the following script:

from random import random

def func(in_vals):
    '''result = x (+y+z+w....)'''
    result = 0
    for v in in_vals:
        result += v
    return result

if __name__ == "__main__":
    entry_count = 1000
    dim_count = 2
    in_vals = [0]*dim_count
    with open("data_yequalsx.csv", "w") as out_file:
        for entry in range(entry_count):
            for i in range(dim_count):
                in_vals[i] = random()
            out_val = func(in_vals)
            out_file.write(','.join([str(x) for x in in_vals]))
            out_file.write(",%s\n" % str(out_val))

I then ran the following Scikit script:

import sklearn
from sklearn import linear_model

import numpy as np

data = []
target = []
with open("data_yequalsx.csv") as inFile:
    for row in inFile:
        vals = row.split(",")
        data.append([float(x) for x in vals[:-1]])
        target.append(float(vals[-1]))

test_samples= len(data)/10

train_data = [0]*(len(data) - test_samples)
train_target = [0]*(len(data) - test_samples)
test_data = [0]*(test_samples)
test_target = [0]*(test_samples)
train_index = 0
test_index = 0
for j in range(len(data)):
    if j >= test_samples:
        train_data[train_index] = data[j]
        train_target[train_index] = target[j]
        train_index += 1
    else:
        test_data[test_index] = data[j]
        test_target[test_index] = target[j]
        test_index += 1

model = linear_model.SGDRegressor(n_iter=100, learning_rate="invscaling", eta0=0.0001, power_t=0.5, penalty="l2", alpha=0.0001, loss="squared_loss")
model.fit(train_data, train_target)
print(model.coef_)
print(model.intercept_)

result = model.predict(test_data)
mse = np.mean((result - test_target) ** 2)
print("Mean Squared Error = %s" % str(mse))

And then this Spark script: (with spark-submit , no other arguments)

from pyspark.mllib.regression import LinearRegressionWithSGD, LabeledPoint
from pyspark import SparkContext

sc = SparkContext (appName="mllib_simple_accuracy")

raw_data = sc.textFile ("data_yequalsx.csv", minPartitions=10) #MinPartitions doesnt guarantee that you get that many partitions, just that you wont have fewer than that many partitions
data = raw_data.map(lambda line: [float(x) for x in line.split (",")]).map(lambda entry: LabeledPoint (entry[-1], entry[:-1])).zipWithIndex()
test_samples= data.count()/10

training_data = data.filter(lambda (entry, index): index >= test_samples).map(lambda (lp,index): lp)
test_data = data.filter(lambda (entry, index): index < test_samples).map(lambda (lp,index): lp)

model = LinearRegressionWithSGD.train(training_data, step=0.01, iterations=100, regType="l2", regParam=0.0001, intercept=True)
print(model._coeff)
print(model._intercept)

mse = (test_data.map(lambda lp: (lp.label - model.predict(lp.features))**2 ).reduce(lambda x,y: x+y))/test_samples;
print("Mean Squared Error: %s" % str(mse))

sc.stop ()

Strangely though, the error given by spark is an order of magnitude larger than that given by Scikit (0.185 and 0.045 respectively) despite the two models having a nearly identical setup (as far as I can tell) I understand that this is using SGD with very few iterations and so the results may differ but I wouldn't have thought that it would be anywhere near such a large difference or such a large error, especially given the exceptionally simple data.

Is there something I'm misunderstanding in Spark? Is it not correctly configured? Surely I should be getting a smaller error than that?

621

asked Jan 21 '15 20:01

JacquesH

2 Answers

SGD, which stands for Stochastic Gradient Descent, is an online convex optimization algorithm, and therefore very difficult to parallelize, since it makes one update per iteration (there are smarter variantes such as SGD with mini-batches, but still not very good for parallel environment.

On the other hand, batch algorithms, such as L-BFGS, hwich I advise you to use with Spark (LogigisticRegressionWithLBFGS), can be easily parallelized, since it makes on iteration per epoch (it need to see all datapoints, calculate the value and gradient of the loss function of each point, then performs the aggregation to calculate the full gradient).

Python is ran in a single machine, therefore SGD performs well.

By the way, if you look in MLlib code, the equivalent of scikit learn's lambda is lambda/size of the dataset (mllib optimizes 1/n*sum(l_i(x_i,f(y_i)) + lambda while scikit learn optimizes sum(l_i(x_i,f(y_i)) + lambda

141

answered Oct 14 '22 05:10

Oussama

Because Spark is parallelized, each node needs to be able to work independently of the other nodes when the computation is under-way to avoid [time-]expensive shuffles between the nodes. Consequently, it uses a procedure called Stochastic Gradient Descent to approach a minimum, which follows local gradients downwards.

The 'exact' way to solve a [simple, least-squares] regression problem involves solving a matrix equation. This is probably what Scikit-Learn is doing, so in this case it will be more accurate.

The trade-off is that solving matrix equations generally scales as N^3 for a size-N square matrix, which rapidly becomes unfeasible for large datasets. Spark swaps accuracy for computational power. As with any machine-learning procedure, you should build in LOTS of sanity checks throughout your algorithms to make sure that the results of the previous step make any sense.

Hope this helps!

answered Oct 14 '22 03:10

StackG

Related questions
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                Working Around Performance & Memory Issues with spark-sql GROUP BY
                            
                                scala.ScalaReflectionException: <none> is not a term
                            
                                Accessing HBase tables through Spark
                            
                                Running Spark on AWS EMR, how to run driver on master node?
                            
                                how can you calculate the size of an apache spark data frame using pyspark?
                            
                                Spark 2.3 submit on Kubernetes error
                            
                                Does Spark lock the File while writing to HDFS or S3
                            
                                Merge Schema with int and double cannot be resolved when reading parquet file
                            
                                How to filter a dataset according to datetime values in Spark
                            
                                Accumulator fails on cluster, works locally
                            
                                Make YARN clean up appcache before retry
                            
                                Build stateful chain for different events and assign global ID in spark
                            
                                Unable to connect Google Storage file using GSC connector from Spark
                            
                                Spark - Serializing an object with a non-serializable member
                            
                                org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times
                            
                                BigQuery connector for pyspark via Hadoop Input Format example
                            
                                Spark: Find pairs having at least n common attributes?
                            
                                How to show the spark progress bar in Jupyter notebook (using pyspark)
                            
                                Spark 2.3 Memory Leak on Executor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is Apache Spark less accurate than Scikit Learn?

Tags:

machine-learning

apache-spark

scikit-learn

linear-regression

JacquesH

People also ask

2 Answers

Oussama

StackG

Recent Activity

Donate For Us