I'm trying to parallelize a collection with Spark and the example in the documentation doesn't seem to work:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
I'm creating a list of LabeledPoint
s from records each of which contain data points (double[]
) and a label (defaulted: true/false).
public List<LabeledPoint> createLabeledPoints(List<ESRecord> records) {
List<LabeledPoint> points = new ArrayList<>();
for (ESRecord rec : records) {
points.add(new LabeledPoint(
rec.defaulted ? 1.0 : 0.0, Vectors.dense(rec.toDataPoints())));
}
return points;
}
public void test(List<ESRecord> records) {
SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
SparkContext sc = new SparkContext(conf);
List<LabeledPoint> points = createLabeledPoints(records);
JavaRDD<LabeledPoint> data = sc.parallelize(points);
...
}
The function signature of parallelize is no longer taking one parameter, here is how it looks in spark-mllib_2.11 v1.3.0: sc.parallelize(seq, numSlices, evidence$1)
So any ideas on how to get this working?
In Java, you should use JavaSparkContext
.
https://spark.apache.org/docs/0.6.2/api/core/spark/api/java/JavaSparkContext.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With