Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do LabelEncoding or categorical value in Apache Spark

I am having dataset contains String columns . How can I encode the string based columns like the one we do in scikit-learn LabelEncoder

like image 385
Abhishek Choudhary Avatar asked Jun 01 '15 18:06

Abhishek Choudhary


1 Answers

StringIndexer is what you need https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame(
            [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
            ["id", "category"]) 
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
indexed = indexer.fit(df).transform(df) 
indexed.show()
like image 140
Sergey Makarevich Avatar answered Sep 27 '22 17:09

Sergey Makarevich