Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark dataframe convert multiple columns to float

I am trying to convert multiple columns of a dataframe from string to float like this

df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()

but I am getting the error

select() argument after * must be a sequence, not generator

I cannot understand why this error is being thrown

like image 936
MARK Avatar asked Nov 08 '16 02:11

MARK


People also ask

How do I change the DataType of multiple columns in PySpark?

Method 1: Using DataFrame.withColumn() The DataFrame. withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name. We will make use of cast(x, dataType) method to casts the column to a different data type.

How do you cast to float in PySpark?

In order to typecast an integer to decimal in pyspark we will be using cast() function with DecimalType() as argument, To typecast integer to float in pyspark we will be using cast() function with FloatType() as argument.

How do I change the DataType of a column in PySpark DataFrame?

withColumn() – Change Column Type Use withColumn() to convert the data type of a DataFrame column, This function takes column name you wanted to convert as a first argument and for the second argument apply the casting method cast() with DataType on the column.


3 Answers

float() is not a Spark function, you need the function cast():

from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
like image 59
mtoto Avatar answered Oct 06 '22 23:10

mtoto


if you want to cast some columns without change the whole data frame, you can do that by withColumn function:

for col_name in cols:
    df = df.withColumn(col_name, col(col_name).cast('float'))

this will cast type of columns in cols list and keep another columns as is.
Note:
withColumn function used to replace or create new column based on name of column;
if column name is exist it will be replaced, else it will be created

like image 41
Nimer Esam Avatar answered Oct 06 '22 23:10

Nimer Esam


If you want to cast multiple columns to float and keep other columns the same, you can use a single select statement.

columns_to_cast = ["col1", "col2", "col3"]
df_temp = (
   df
   .select(
     *(c for c in df.columns if c not in columns_to_cast),
     *(col(c).cast("float").alias(c) for c in columns_to_cast)
   )
)

I saw the withColumn answer which will work, but since spark dataframes are immutable, each withColumn call generates a completely new dataframe

like image 6
Justin Davis Avatar answered Oct 07 '22 01:10

Justin Davis