Apply StringIndexer to several columns in a PySpark Dataframe

Tags:

I have a PySpark dataframe

+-------+--------------+----+----+ |address|          date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra | |1111112|20151122065832| Yun|ddd | |1111113|20160101003221| Yan|fdf | |1111111|20160703045231| Yin|gre | |1111114|20150419134543| Yin|fdf | |1111115|20151123174302| Yen|ddd | |2111115|      20123192| Yen|gre | +-------+--------------+----+----+

that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:

indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df) df_ind = indexer.transform(df) df_ind.show() +-------+--------------+----+----------+----+ |address|          date|name|name_index|food| +-------+--------------+----+----------+----+ |1111111|20151122045510| Yin|       0.0|gre | |1111111|20151122045501| Yin|       0.0|gre | |1111111|20151122045500| Yln|       2.0|gra | |1111112|20151122065832| Yun|       4.0|ddd | |1111113|20160101003221| Yan|       3.0|fdf | |1111111|20160703045231| Yin|       0.0|gre | |1111114|20150419134543| Yin|       0.0|fdf | |1111115|20151123174302| Yen|       1.0|ddd | |2111115|      20123192| Yen|       1.0|gre | +-------+--------------+----+----------+----+

How can I transform several columns with StringIndexer (for example, name and food, each with its own StringIndexer) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer for each column?

** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer or VectorAssembler because the columns are not numerical.

** EDIT 2**: A tentative solution is

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]

where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.

379

asked Apr 29 '16 15:04

Ivan

1 Answers

The best way that I've found to do it is to combine several StringIndex on a list and use a Pipeline to execute them all:

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer  indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]   pipeline = Pipeline(stages=indexers) df_r = pipeline.fit(df).transform(df)  df_r.show() +-------+--------------+----+----+----------+----------+-------------+ |address|          date|food|name|food_index|name_index|address_index| +-------+--------------+----+----+----------+----------+-------------+ |1111111|20151122045510| gre| Yin|       0.0|       0.0|          0.0| |1111111|20151122045501| gra| Yin|       2.0|       0.0|          0.0| |1111111|20151122045500| gre| Yln|       0.0|       2.0|          0.0| |1111112|20151122065832| gre| Yun|       0.0|       4.0|          3.0| |1111113|20160101003221| gre| Yan|       0.0|       3.0|          1.0| |1111111|20160703045231| gre| Yin|       0.0|       0.0|          0.0| |1111114|20150419134543| gre| Yin|       0.0|       0.0|          5.0| |1111115|20151123174302| ddd| Yen|       1.0|       1.0|          2.0| |2111115|      20123192| ddd| Yen|       1.0|       1.0|          4.0| +-------+--------------+----+----+----------+----------+-------------+

152

answered Sep 23 '22 16:09

Ivan

Related questions
                            
                                Can I import Python's 3.6's formatted string literals (f-strings) into older 3.x, 2.x Python?
                            
                                What's the best way to initialise and use constants across Python classes?
                            
                                Create hourly/minutely time range using pandas
                            
                                Django model inheritance: create sub-instance of existing instance (downcast)?
                            
                                How can I allow django admin to set a field to NULL?
                            
                                How do I tell if a column in a pandas dataframe is of type datetime? How do I tell if a column is numerical?
                            
                                Ignore case in Python strings [duplicate]
                            
                                In what situation should the built-in 'operator' module be used in python?
                            
                                Python optparse Values Instance
                            
                                Get the mean across multiple Pandas DataFrames
                            
                                Python Argparse conditionally required arguments
                            
                                List of Tuples to DataFrame Conversion [duplicate]
                            
                                Python's Multiple Inheritance: Picking which super() to call
                            
                                How do I bind the enter key to a function in tkinter?
                            
                                How to update a document using elasticsearch-py?
                            
                                list memory usage in ipython and jupyter
                            
                                Pandas DataFrames with NaNs equality comparison
                            
                                Matplotlib: How to plot images instead of points?
                            
                                Try-except clause with an empty except code [duplicate]
                            
                                Find matching rows in 2 dimensional numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply StringIndexer to several columns in a PySpark Dataframe

Tags:

python

apache-spark

pyspark

Ivan

People also ask

1 Answers

Ivan

Recent Activity

Donate For Us