I am having dataset contains String columns . How can I encode the string based columns like the one we do in scikit-learn LabelEncoder
StringIndexer is what you need https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
from pyspark.ml.feature import StringIndexer
df = sqlContext.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With