I am trying to convert multiple columns of a dataframe from string to float like this
df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()
but I am getting the error
select() argument after * must be a sequence, not generator
I cannot understand why this error is being thrown
Method 1: Using DataFrame.withColumn() The DataFrame. withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name. We will make use of cast(x, dataType) method to casts the column to a different data type.
In order to typecast an integer to decimal in pyspark we will be using cast() function with DecimalType() as argument, To typecast integer to float in pyspark we will be using cast() function with FloatType() as argument.
withColumn() – Change Column Type Use withColumn() to convert the data type of a DataFrame column, This function takes column name you wanted to convert as a first argument and for the second argument apply the casting method cast() with DataType on the column.
float()
is not a Spark function, you need the function cast()
:
from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
if you want to cast some columns without change the whole data frame, you can do that by withColumn function:
for col_name in cols:
df = df.withColumn(col_name, col(col_name).cast('float'))
this will cast type of columns in cols list and keep another columns as is.
Note:
withColumn function used to replace or create new column based on name of column;
if column name is exist it will be replaced, else it will be created
If you want to cast multiple columns to float and keep other columns the same, you can use a single select statement.
columns_to_cast = ["col1", "col2", "col3"]
df_temp = (
df
.select(
*(c for c in df.columns if c not in columns_to_cast),
*(col(c).cast("float").alias(c) for c in columns_to_cast)
)
)
I saw the withColumn answer which will work, but since spark dataframes are immutable, each withColumn call generates a completely new dataframe
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With