have a DataFrame with some categorical string values (e.g uuid|url|browser).
I would to convert it in a double to execute an ML algorithm that accept double matrix.
As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:
def str(arg: String, df:DataFrame) : DataFrame =
(
val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
val newDF = indexer.fit(df).transform(df)
return newDF
)
Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:
Initial df:
[String: uuid|String: url| String: browser]
Final df:
[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]
Thanks in advance
Notice the data outputs as numpy array. To add the vectors to the dataframe, use numpy.array ().tolist (). This will save them as a list of lists. Then they can be added as a column to the dataframe. Notice the card2vec column contains the Doc2Vec vectors. Adding the vectors to the dataframe is a convenient way to store them.
That’s what vectorization is for. What is vectorization? Vectorization is jargon for a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support.
] The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms. Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at once.. In a vectorized calculation, all elements of the vector (array) can be added in one calculation step.
You can simply foldLeft
over the Array
of columns:
val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
Still, I will argue that it is not a good approach. Since src
discards StringIndexerModel
it cannot be used when you get new data. Because of that I would recommend using Pipeline
:
import org.apache.spark.ml.Pipeline
val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)
VectorAssembler
can be included like this:
val assembler = new VectorAssembler()
.setInputCols(df.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val stages = transformers :+ assembler
You could also use RFormula
, which is less customizable, but much more concise:
import org.apache.spark.ml.feature.RFormula
val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With