Spark Version: spark-2.0.1-bin-hadoop2.7
Scala: 2.11.8
I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format
includes string of column names that need to be converted to yyyy-mm-dd format.
In the following code, I first loaded the csv of Date column as StringType via the schema
, and then I check if the date_format
is not empty, that is there are columns that need to be converted to Date
from String
, then cast each column using unix_timestamp
and to_date
. However, in the csv_df.show()
, the returned rows are all null
.
def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean,
schema:StructType, date_format:List[String]): DataFrame = {
println("|||| Reading CSV Input ||||")
var csv_df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", is_first_line_header)
.option("delimiter", delimiter)
.load(csv_source)
println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
if(date_format.length > 0) {
for (i <- 0 until date_format.length) {
csv_df = csv_df.select(to_date(unix_timestamp(
csv_df(date_format(i)), "yyyy-MM-dd").cast("timestamp")))
csv_df.show()
}
}
csv_df
}
Returned Top 20 rows:
+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-MM-DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------------------------------------------------------------------+
Why am I getting all null
?
The isNull method returns true if the column contains a null value and false otherwise. The isNotNull method returns true if the column does not contain a null value, and false otherwise. The isin method returns true if the column is contained in a list of arguments and false otherwise.
current_date. current_date() - Returns the current date at the start of query evaluation.
You may have noticed, there is no function to validate date and timestamp values in Spark SQL. Alternatively, you can use Hive date functions to filter out unwanted date.
To convert yyyyMMdd
to yyyy-MM-dd
you can:
spark.sql("""SELECT DATE_FORMAT(
CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")
with functions:
date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With