I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values: <pre class="prettyprint lang-python prettyprint-override"><code>+------------+ | level | +------------+ | Medium| | Medium| | Medium| | High| | Medium| | Medium| | Low| | Low| | High| | Low| | Low| </code></pre> And I want to create a new column where these values get converted to: <pre class="prettyprint lang-python prettyprint-override"><code>"High"= 1, "Medium" = 2, "Low" = 3 +------------+ | level_num| +------------+ | 2| | 2| | 2| | 1| | 2| | 2| | 3| | 3| | 1| | 3| | 3| </code></pre> I've tried defining a function and doing a foreach over the dataframe like so: <pre class="prettyprint lang-python prettyprint-override"><code>def f(x): if(x == 'Medium'): return 2 elif(x == "Low"): return 3 else: return 1 a = df.select("level").rdd.foreach(f) </code></pre> But this returns a "None" type. Thoughts? Thanks for the help as always!

You can certainly do this along the lines you have been trying - you'll need a <code>map</code> operation instead of <code>foreach</code>. <pre class="prettyprint lang-python prettyprint-override"><code>spark.version # u'2.2.0' from pyspark.sql import Row # toy data: df = spark.createDataFrame([Row("Medium"), Row("High"), Row("High"), Row("Low") ], ["level"]) df.show() # +------+ # | level| # +------+ # |Medium| # | High| # | High| # | Low| # +------+ </code></pre> Using your <code>f(x)</code> with these toy data, we get: <pre class="prettyprint lang-python prettyprint-override"><code>df.select("level").rdd.map(lambda x: f(x[0])).collect() # [2, 1, 1, 3] </code></pre> And one more <code>map</code> will give you a dataframe: <pre class="prettyprint lang-python prettyprint-override"><code>df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show() # +---------+ # |level_num| # +---------+ # | 2| # | 1| # | 1| # | 3| # +---------+ </code></pre> But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function <code>when</code> instead of your <code>f(x)</code>: <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import col, when df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show() # +------+---------+ # | level|level_num| # +------+---------+ # |Medium| 2| # | High| 1| # | High| 1| # | Low| 3| # +------+---------+ </code></pre>

Pyspark Dataframe - Map Strings to Numerics

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

spark-dataframe

I'm looking for a way to convert a given column of data, in this case strings, and convert them into a numeric representation. For example, I have a dataframe of strings with values:

+------------+
|    level   |
+------------+
|      Medium|
|      Medium|
|      Medium|
|        High|
|      Medium|
|      Medium|
|         Low|
|         Low|
|        High|
|         Low|
|         Low|

And I want to create a new column where these values get converted to:

"High"= 1, "Medium" = 2, "Low" = 3

+------------+
|   level_num|
+------------+
|           2|
|           2|
|           2|
|           1|
|           2|
|           2|
|           3|
|           3|
|           1|
|           3|
|           3|

I've tried defining a function and doing a foreach over the dataframe like so:

def f(x): 
    if(x == 'Medium'):
       return 2
    elif(x == "Low"):
       return 3
    else:
       return 1

 a = df.select("level").rdd.foreach(f)

But this returns a "None" type. Thoughts? Thanks for the help as always!

557

asked Nov 30 '17 16:11

Brian Behe

1 Answers

You can certainly do this along the lines you have been trying - you'll need a map operation instead of foreach.

spark.version
# u'2.2.0'

from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row("Medium"),
                              Row("High"),
                              Row("High"),
                              Row("Low")
                             ],
                              ["level"])
df.show()
# +------+ 
# | level|
# +------+
# |Medium|
# |  High|
# |  High|
# |   Low|
# +------+

Using your f(x) with these toy data, we get:

df.select("level").rdd.map(lambda x: f(x[0])).collect()
# [2, 1, 1, 3]

And one more map will give you a dataframe:

df.select("level").rdd.map(lambda x: f(x[0])).map(lambda x: Row(x)).toDF(["level_num"]).show()
# +---------+ 
# |level_num|
# +---------+
# |        2|
# |        1|
# |        1| 
# |        3|
# +---------+

But it would be preferable to do it without invoking a temporary intermediate RDD, using the dataframe function when instead of your f(x):

from pyspark.sql.functions import col, when

df.withColumn("level_num", when(col("level")=='Medium', 2).when(col("level")=='Low', 3).otherwise(1)).show()
# +------+---------+ 
# | level|level_num|
# +------+---------+
# |Medium|        2|
# |  High|        1| 
# |  High|        1|
# |   Low|        3| 
# +------+---------+

150

answered Sep 23 '22 16:09

desertnaut

Related questions
                            
                                run pyspark locally
                            
                                Python: How to convert Pyspark column to date type if there are null values
                            
                                How to use spark quantilediscretizer on multiple columns
                            
                                PySpark sampleBy using multiple columns
                            
                                How to interpret probability column in spark logistic regression prediction?
                            
                                How to specify the location of custom log4j.configuration when spark-submit to Amazon EMR?
                            
                                Unbounded table is spark structured streaming
                            
                                Visualizing topics with Spark LDA
                            
                                R - How to replicate rows in a spark dataframe using sparklyr
                            
                                Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns? [duplicate]
                            
                                How does Spark SQL read compressed csv files?
                            
                                S3A: fails while S3: works in Spark EMR
                            
                                with pyspark.sql.functions unix_timestamp get null
                            
                                Streaming data store in hive using spark
                            
                                How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
                            
                                reuse the result of a select expression in the "GROUP BY" clause?
                            
                                Spark DataFrame operators (nunique, multiplication)
                            
                                Is it possible to print definition of a function in Scala
                            
                                read/write dynamo db from apache spark [closed]
                            
                                java.lang.IllegalArgumentException: Invalid lambda deserialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With