convert dataframe to libsvm format

I have a dataframe resulting from a sql query

df1 = sqlContext.sql("select * from table_test")

I need to convert this dataframe to libsvm format so that it can be provided as an input for

pyspark.ml.classification.LogisticRegression

I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2

df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm

I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error

import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils

Question 1: Is my above approach to convert dataframe to libsvm format in the right direction. Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format

What is LIBSVM format?

MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR . It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ...

What is LIBSVM format in spark?

libsvm package implements Spark SQL data source API for loading LIBSVM data as DataFrame . The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vector s.

I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):

This is my way to convert dataframe to libsvm format:

# ... your previous imports

from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  3|  6|  
|  4|  5| 20|
|  7|  8|  8|
+---+---+---+

# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]

# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]

# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")

What you will see on the "/your/Path/nameFolder/part-0000*" files is:

1.0 1:3.0 2:6.0

4.0 1:5.0 2:20.0

7.0 1:8.0 2:8.0

See here for LabeledPoint docs

I had to do this for it to work

D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))

convert dataframe to libsvm format

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

apache-spark-mllib

sah.stc

People also ask

2 Answers

titiro89

pemfir

Recent Activity

Donate For Us

convert dataframe to libsvm format

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

apache-spark-mllib

sah.stc

People also ask

2 Answers

titiro89

pemfir

Related questions

Recent Activity

Donate For Us