I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pyspark.ml.classification.LogisticRegression
I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2
df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm
I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error
import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils
Question 1: Is my above approach to convert dataframe to libsvm format in the right direction. Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format
MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR . It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ...
libsvm package implements Spark SQL data source API for loading LIBSVM data as DataFrame . The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vector s.
I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):
This is my way to convert dataframe to libsvm format:
# ... your previous imports
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 3| 6|
| 4| 5| 20|
| 7| 8| 8|
+---+---+---+
# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
What you will see on the "/your/Path/nameFolder/part-0000*" files is:
1.0 1:3.0 2:6.0
4.0 1:5.0 2:20.0
7.0 1:8.0 2:8.0
See here for LabeledPoint docs
I had to do this for it to work
D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With