Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply StringIndexer to several columns in a PySpark Dataframe

I have a PySpark dataframe

+-------+--------------+----+----+ |address|          date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra | |1111112|20151122065832| Yun|ddd | |1111113|20160101003221| Yan|fdf | |1111111|20160703045231| Yin|gre | |1111114|20150419134543| Yin|fdf | |1111115|20151123174302| Yen|ddd | |2111115|      20123192| Yen|gre | +-------+--------------+----+----+ 

that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:

indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df) df_ind = indexer.transform(df) df_ind.show() +-------+--------------+----+----------+----+ |address|          date|name|name_index|food| +-------+--------------+----+----------+----+ |1111111|20151122045510| Yin|       0.0|gre | |1111111|20151122045501| Yin|       0.0|gre | |1111111|20151122045500| Yln|       2.0|gra | |1111112|20151122065832| Yun|       4.0|ddd | |1111113|20160101003221| Yan|       3.0|fdf | |1111111|20160703045231| Yin|       0.0|gre | |1111114|20150419134543| Yin|       0.0|fdf | |1111115|20151123174302| Yen|       1.0|ddd | |2111115|      20123192| Yen|       1.0|gre | +-------+--------------+----+----------+----+ 

How can I transform several columns with StringIndexer (for example, name and food, each with its own StringIndexer) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer for each column?

** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer or VectorAssembler because the columns are not numerical.

** EDIT 2**: A tentative solution is

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ] 

where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.

like image 379
Ivan Avatar asked Apr 29 '16 15:04

Ivan


People also ask

How do I apply a function to multiple columns in PySpark?

You can use reduce , for loops, or list comprehensions to apply PySpark functions to multiple columns in a DataFrame. Using iterators to apply the same operation on multiple columns is vital for maintaining a DRY codebase.

How do you use the Stringindexer in PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

How do you add a suffix to all columns in PySpark?

If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and . withColumnRenamed().


1 Answers

The best way that I've found to do it is to combine several StringIndex on a list and use a Pipeline to execute them all:

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer  indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]   pipeline = Pipeline(stages=indexers) df_r = pipeline.fit(df).transform(df)  df_r.show() +-------+--------------+----+----+----------+----------+-------------+ |address|          date|food|name|food_index|name_index|address_index| +-------+--------------+----+----+----------+----------+-------------+ |1111111|20151122045510| gre| Yin|       0.0|       0.0|          0.0| |1111111|20151122045501| gra| Yin|       2.0|       0.0|          0.0| |1111111|20151122045500| gre| Yln|       0.0|       2.0|          0.0| |1111112|20151122065832| gre| Yun|       0.0|       4.0|          3.0| |1111113|20160101003221| gre| Yan|       0.0|       3.0|          1.0| |1111111|20160703045231| gre| Yin|       0.0|       0.0|          0.0| |1111114|20150419134543| gre| Yin|       0.0|       0.0|          5.0| |1111115|20151123174302| ddd| Yen|       1.0|       1.0|          2.0| |2111115|      20123192| ddd| Yen|       1.0|       1.0|          4.0| +-------+--------------+----+----+----------+----------+-------------+ 
like image 152
Ivan Avatar answered Sep 23 '22 16:09

Ivan