I have a PySpark dataframe
+-------+--------------+----+----+ |address| date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra | |1111112|20151122065832| Yun|ddd | |1111113|20160101003221| Yan|fdf | |1111111|20160703045231| Yin|gre | |1111114|20150419134543| Yin|fdf | |1111115|20151123174302| Yen|ddd | |2111115| 20123192| Yen|gre | +-------+--------------+----+----+
that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:
indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df) df_ind = indexer.transform(df) df_ind.show() +-------+--------------+----+----------+----+ |address| date|name|name_index|food| +-------+--------------+----+----------+----+ |1111111|20151122045510| Yin| 0.0|gre | |1111111|20151122045501| Yin| 0.0|gre | |1111111|20151122045500| Yln| 2.0|gra | |1111112|20151122065832| Yun| 4.0|ddd | |1111113|20160101003221| Yan| 3.0|fdf | |1111111|20160703045231| Yin| 0.0|gre | |1111114|20150419134543| Yin| 0.0|fdf | |1111115|20151123174302| Yen| 1.0|ddd | |2111115| 20123192| Yen| 1.0|gre | +-------+--------------+----+----------+----+
How can I transform several columns with StringIndexer (for example, name
and food
, each with its own StringIndexer
) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer
for each column?
** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer
or VectorAssembler
because the columns are not numerical.
** EDIT 2**: A tentative solution is
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.
You can use reduce , for loops, or list comprehensions to apply PySpark functions to multiple columns in a DataFrame. Using iterators to apply the same operation on multiple columns is vital for maintaining a DRY codebase.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and . withColumnRenamed().
The best way that I've found to do it is to combine several StringIndex
on a list and use a Pipeline
to execute them all:
from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ] pipeline = Pipeline(stages=indexers) df_r = pipeline.fit(df).transform(df) df_r.show() +-------+--------------+----+----+----------+----------+-------------+ |address| date|food|name|food_index|name_index|address_index| +-------+--------------+----+----+----------+----------+-------------+ |1111111|20151122045510| gre| Yin| 0.0| 0.0| 0.0| |1111111|20151122045501| gra| Yin| 2.0| 0.0| 0.0| |1111111|20151122045500| gre| Yln| 0.0| 2.0| 0.0| |1111112|20151122065832| gre| Yun| 0.0| 4.0| 3.0| |1111113|20160101003221| gre| Yan| 0.0| 3.0| 1.0| |1111111|20160703045231| gre| Yin| 0.0| 0.0| 0.0| |1111114|20150419134543| gre| Yin| 0.0| 0.0| 5.0| |1111115|20151123174302| ddd| Yen| 1.0| 1.0| 2.0| |2111115| 20123192| ddd| Yen| 1.0| 1.0| 4.0| +-------+--------------+----+----+----------+----------+-------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With