I have dataframe in pyspark. Some of its numerical columns contain nan
so when I am reading the data and checking for the schema of dataframe, those columns will have string
type.
How I can change them to int type. I replaced the nan
values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
my data looks like this:
here columns Plays
and drafts
containing integer values but because of nan
present in these columns, they are treated as string type.
withColumn() – Change Column Type Use withColumn() to convert the data type of a DataFrame column, This function takes column name you wanted to convert as a first argument and for the second argument apply the casting method cast() with DataType on the column.
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.
Method 1: Using DataFrame.withColumn() We will make use of cast(x, dataType) method to casts the column to a different data type. Here, the parameter “x” is the column name and dataType is the datatype in which you want to change the respective column to.
You could use cast
(as int) after replacing NaN
with 0
,
data_df = df.withColumn("Plays", df.call_time.cast('float'))
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
You can run loop for each column but this is the simplest way to convert string column into integer.
Another way to do it is using the StructField if you have multiple fields that needs to be modified.
Ex:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
StructField('OPEN_FLG',IntegerType(),True),
StructField('I1_GNDR_CODE',StringType(),True),
StructField('TRW_INCOME_CD_V4',StringType(),True),
StructField('ASIAN_CD',IntegerType(),True),
StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)
Output:
Before
root
|-- CLICK_FLG: string (nullable = true)
|-- OPEN_FLG: string (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
After:
root
|-- CLICK_FLG: integer (nullable = true)
|-- OPEN_FLG: integer (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.
It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With