Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using

dataframe = sqlContext.sql("select * from table").

One column is 'arrival_date' and contains a string.

How do I modify this column so as to only take the first 4 characters from it and throw away the rest?

How would I convert the type of this column from string to date?

In graphlab.SFrame, this would be:

dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )

and

dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
like image 742
Semihcan Doken Avatar asked Dec 24 '22 02:12

Semihcan Doken


2 Answers

As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:

from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))

To convert it to date, you can use to_date, as Orions said:

from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))

However, if you need to specify the format, you should use unix_timestamp:

from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)

All this can be found in the pyspark documentation.

like image 157
Daniel de Paula Avatar answered Jan 03 '23 04:01

Daniel de Paula


To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):

from pyspark.sql.functions import UserDefinedFunction, to_date

old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])

And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:

new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))

Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

like image 20
Ajeet Shah Avatar answered Jan 03 '23 05:01

Ajeet Shah