I have an instance of pyspark.sql.dataframe.DataFrame
created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date
, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp
:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date
(StringType) column, create a new_df
by using UserDefinedFunction
(as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date
(StringType) column into DateType
column, use the to_date
function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With