Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize DataFrame columns for ML algorithms?

have a DataFrame with some categorical string values (e.g uuid|url|browser).

I would to convert it in a double to execute an ML algorithm that accept double matrix.

As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:

def str(arg: String, df:DataFrame) : DataFrame =
   (
    val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
    val newDF = indexer.fit(df).transform(df)
    return newDF
   )

Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:

Initial df:

[String: uuid|String: url| String: browser]

Final df:

[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]

Thanks in advance

like image 326
fase_jhn Avatar asked Sep 02 '15 15:09

fase_jhn


People also ask

How do I add a vector to a Dataframe?

Notice the data outputs as numpy array. To add the vectors to the dataframe, use numpy.array ().tolist (). This will save them as a list of lists. Then they can be added as a column to the dataframe. Notice the card2vec column contains the Doc2Vec vectors. Adding the vectors to the dataframe is a convenient way to store them.

What is vectorization in machine learning?

That’s what vectorization is for. What is vectorization? Vectorization is jargon for a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support.

Can I use document vectors with machine learning algorithms?

] The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms. Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

What is vectorization in C++?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at once.. In a vectorized calculation, all elements of the vector (array) can be added in one calculation step.


1 Answers

You can simply foldLeft over the Array of columns:

val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))

Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:

import org.apache.spark.ml.Pipeline

val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
   cname => new StringIndexer()
     .setInputCol(cname)
     .setOutputCol(s"${cname}_index")
)

// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???

val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)

VectorAssembler can be included like this:

val assembler  = new VectorAssembler()
    .setInputCols(df.columns.map(cname => s"${cname}_index"))
    .setOutputCol("features")

val stages = transformers :+ assembler

You could also use RFormula, which is less customizable, but much more concise:

import org.apache.spark.ml.feature.RFormula

val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)
like image 122
zero323 Avatar answered Oct 05 '22 16:10

zero323