I am trying to convert multiple columns of a dataframe from string to float like this <pre class="prettyprint"><code>df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z")) df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show() </code></pre> but I am getting the error <pre class="prettyprint"><code>select() argument after * must be a sequence, not generator </code></pre> I cannot understand why this error is being thrown

<code>float()</code> is not a Spark function, you need the function <code>cast()</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import col df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns)) </code></pre>

if you want to cast some columns without change the whole data frame, you can do that by withColumn function: <pre class="prettyprint"><code>for col_name in cols: df = df.withColumn(col_name, col(col_name).cast('float')) </code></pre> this will cast type of columns in cols list and keep another columns as is. Note: withColumn function used to replace or create new column based on name of column; if column name is exist it will be replaced, else it will be created

Pyspark dataframe convert multiple columns to float

Tags:

python

apache-spark

pyspark

I am trying to convert multiple columns of a dataframe from string to float like this

df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()

but I am getting the error

select() argument after * must be a sequence, not generator

I cannot understand why this error is being thrown

936

asked Nov 08 '16 02:11

MARK

3 Answers

float() is not a Spark function, you need the function cast():

from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))

answered Oct 06 '22 23:10

mtoto

if you want to cast some columns without change the whole data frame, you can do that by withColumn function:

for col_name in cols:
    df = df.withColumn(col_name, col(col_name).cast('float'))

this will cast type of columns in cols list and keep another columns as is.
Note:
withColumn function used to replace or create new column based on name of column;
if column name is exist it will be replaced, else it will be created

answered Oct 06 '22 23:10

Nimer Esam

If you want to cast multiple columns to float and keep other columns the same, you can use a single select statement.

columns_to_cast = ["col1", "col2", "col3"]
df_temp = (
   df
   .select(
     *(c for c in df.columns if c not in columns_to_cast),
     *(col(c).cast("float").alias(c) for c in columns_to_cast)
   )
)

I saw the withColumn answer which will work, but since spark dataframes are immutable, each withColumn call generates a completely new dataframe

answered Oct 07 '22 01:10

Justin Davis

Related questions
                            
                                C++ How to check the last modified time of a file
                            
                                Angular2 + webpack do not deploy robots.txt
                            
                                Distinct values with pluck
                            
                                Inflate Bottom Navigation View menu programmatically
                            
                                What is the meaning of xs, md, lg in CSS Flexbox system?
                            
                                CocoaPods not updating Firebase SDK to Version 4.0.0
                            
                                No module named pandas_datareader
                            
                                How to specify browser language in Puppeteer
                            
                                Debugging Scrapy Project in Visual Studio Code
                            
                                Is returning an exception an anti-pattern?
                            
                                SpyOn TypeORM repository to change the return value for unit testing NestJS
                            
                                Python requests is slow and takes very long to complete HTTP or HTTPS request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With