I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation : <pre class="prettyprint lang-scala prettyprint-override"><code>val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble) } val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))} val usergroup = user.groupByKey val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)} val data_DF = data.toDF("user","item","rating") </code></pre> <img src="https://i.stack.imgur.com/RsZaJ.jpg" alt="DATAFRAME FIGURE"> I am using Spark 2.0.

The issue you are facing can be divided into the following : <ul> <li>Converting your ratings (I believe) into <code>LabeledPoint</code> data X.</li> <li>Saving X in libsvm format.</li> </ul> 1. Converting your ratings into <code>LabeledPoint</code> data X Let's consider the following raw ratings : <pre class="prettyprint"><code>val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5") </code></pre> You can handle those raw ratings as a coordinate list matrix (COO). Spark implements a distributed matrix backed by an RDD of its entries : <code>CoordinateMatrix</code> where each entry is a tuple of (i: Long, j: Long, value: Double). Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.) <pre class="prettyprint"><code>import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} import org.apache.spark.rdd.RDD val data: RDD[MatrixEntry] = sc.parallelize(rawRatings).map { line => { val fields = line.split(",") val i = fields(0).toLong val j = fields(1).toLong val value = fields(2).toDouble MatrixEntry(i, j, value) } } </code></pre> Now let's convert that <code>RDD[MatrixEntry]</code> to a <code>CoordinateMatrix</code> and extract the indexed rows : <pre class="prettyprint"><code>val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix .toIndexedRowMatrix().rows // Extract indexed rows .toDF("label", "features") // Convert rows </code></pre> 2. Saving LabeledPoint data in libsvm format Since Spark 2.0, You can do that using the <code>DataFrameWriter</code> . Let's create a small example with some dummy LabeledPoint data (you can also use the <code>DataFrame</code> we created earlier) : <pre class="prettyprint"><code>import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) val df = Seq(neg,pos).toDF("label","features") </code></pre> Unfortunately we still can't use the <code>DataFrameWriter</code> directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types. Utilities for converting DataFrame columns from <code>mllib.linalg</code> to <code>ml.linalg</code> types (and vice versa) can be found in <code>org.apache.spark.mllib.util.MLUtils.</code> In our case we need to do the following (for both the dummy data and the <code>DataFrame</code> from <code>step 1.</code>) <pre class="prettyprint"><code>import org.apache.spark.mllib.util.MLUtils // convert DataFrame columns val convertedVecDF = MLUtils.convertVectorColumnsToML(df) </code></pre> Now let's save the DataFrame : <pre class="prettyprint"><code>convertedVecDF.write.format("libsvm").save("data/foo") </code></pre> And we can check the files contents : <pre class="prettyprint"><code>$ cat data/foo/part* 0.0 1:1.0 3:3.0 1.0 1:1.0 2:0.0 3:3.0 </code></pre> EDIT: In current version of spark (2.1.0) there is no need to use <code>mllib</code> package. You can simply save <code>LabeledPoint</code> data in libsvm format like below: <pre class="prettyprint"><code>import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.feature.LabeledPoint val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) val df = Seq(neg,pos).toDF("label","features") df.write.format("libsvm").save("data/foo") </code></pre>

How to prepare data into a LibSVM format from DataFrame?

Tags:

apache-spark

apache-spark-sql

libsvm

apache-spark-ml

apache-spark-mllib

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :

val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
     val fields = line.split(",")
      (fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey 

val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}

val data_DF = data.toDF("user","item","rating")

DATAFRAME FIGURE

I am using Spark 2.0.

617

asked Jan 01 '17 14:01

Data diaboli

Video Answer

2 Answers

The issue you are facing can be divided into the following :

Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.

1. Converting your ratings into LabeledPoint data X

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

You can handle those raw ratings as a coordinate list matrix (COO).

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] = 
      sc.parallelize(rawRatings).map {
            line => {
                  val fields = line.split(",")
                  val i = fields(0).toLong
                  val j = fields(1).toLong
                  val value = fields(2).toDouble
                  MatrixEntry(i, j, value)
            }
      }

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                .toIndexedRowMatrix().rows // Extract indexed rows
                .toDF("label", "features") // Convert rows

2. Saving LabeledPoint data in libsvm format

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

EDIT: In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")

117

answered Oct 20 '22 15:10

eliasah

In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:

case class LibSvmEntry (
   value: Double,
   features: L.Vector)

The you can use the map function to convert it to a LibSVM entry like so: df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)